PR #579

open

The Frugendorff: Recursive Weight Sharing for Transformer Compression (1.1478 BPB, 15.19MB)

by newjordanView on GitHub

val_bpb

1.1355

Architecture

Transformer

Optimizer

Muon (matrices) and AdamW (embeddings and scalars)

Artifact Size

15.19 MB

Training Techniques

Quantization

int6 per-row with GPTQ Hessian-aware quantization

bits: 6

scope: MLP and attention weights

Architecture

recursive weight sharing

K unique transformer blocks applied N times in sequence to produce deeper effective networks from fewer stored parameters

parameters: {"unique_blocks":6,"loops":2,"effective_depth":12,"MLP_expansion":"4x"}

asymmetric weight sharing (Micro Crawler)

4 unique flat blocks run once, then 2 shared crawler blocks run twice with orthogonal positions to isolate gradient conflict

parameters: {"flat_blocks":4,"crawler_blocks":2,"crawler_loops":2,"effective_depth":8}

bidirectional persistent deliberation gate

Learned consensus parameter with bidirectional gradient flow between recursive firings to improve communication and model quality

parameters: null

MLP expansion

4x MLP expansion enabled by parameter savings from weight sharing

parameters: {"hidden_dim":2560,"activation":"relu-squared"}

attention

GQA with 10 heads and 5 KV heads, XSA on last 2 blocks

parameters: {"num_heads":10,"num_kv_heads":5,"XSA_layers":2}

input conditioning

BigramHash (2048 buckets) for Frugendorff Squared; TrigramHash (8192 buckets, 3 orthogonal hash primes) for Micro Crawler

parameters: null

position embeddings

QR-initialized orthogonal vectors, one per loop iteration

parameters: null

U-Net skip connections

Within each loop iteration

parameters: null

Optimizer

Muon (matrices) and AdamW (embeddings and scalars)

weight_decay: null

momentum: 0.99

other_params: {"Muon_lr":0.025,"AdamW_embeddings_lr":0.035,"AdamW_scalars_lr":0.025,"gradient_clip":0.3}

Weight Averaging

SWA and EMA

parameters: {"SWA_frequency":"every 50 steps when scale < 0.2","EMA_decay":0.997,"EMA_applied_after_distillation":true}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

layerwise LN scale

parameters: {"scale_factor":"1/sqrt(layer_idx+1)"}

Other

other

Late QAT: int6 fake-quantization applied when learning rate scale < 0.15

parameters: null

other

Late Training Replay: 2-epoch replay of last 100 training batches at 10% learning rate

parameters: null

other

Self-distillation with EMA teacher, 50 steps, temperature=2.0, alpha=0.7

parameters: null

Novel Contributions

Recursive weight sharing architecture applying K unique transformer blocks N times to create deeper effective networks with fewer parameters
Asymmetric weight sharing (Micro Crawler) isolating gradient conflict to fewer blocks to improve quality and quantization robustness
Bidirectional persistent deliberation gate enabling communication between recursive firings with gradient flow in both directions
Reinvestment of saved parameter budget into wider 4x MLP layers enabled by fractal weight sharing
Demonstration that steps matter more than depth due to faster training steps with recursive sharing
Use of GPTQ Hessian-aware quantization to significantly reduce quantization gap for shared weights