PR #499

closed

The Frugendorff: Recursive Weight Sharing + MLP 4x (1.1478 BPB, 15.19MB)

by newjordanView on GitHub
val_bpb
1.1478
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.19MB

Training Techniques

Architecture
depth recurrence / recursive weight sharing
6 unique transformer blocks are each applied twice in sequence, yielding 12 effective layers from shared parameters.
parameters: {"unique_blocks":6,"loops_per_block":2,"effective_depth":12}
MLP3x
Expanded feed-forward network to 4x width (hidden size 2560) using freed parameter budget.
parameters: {"mlp_multiplier":4,"hidden_size":2560}
Partial RoPE
Uses partial rotary position embeddings on a subset of dimensions with NTK-aware scaling.
parameters: {"rope_dims":16,"total_dims":64}
SmearGate
Custom gating mechanism used in the model.
parameters: null
BigramHash
Hash-based bigram feature module with shared embeddings.
parameters: {"buckets":2048,"dimension":128}
tied embeddings
Input and output embeddings are shared.
parameters: null
XSA
XSA applied on the last 2 unique blocks.
parameters: {"last_blocks":2}
Initialization
QR-initialized orthogonal loop position embeddings
Orthogonal loop position embeddings initialized with QR decomposition.
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"scope":"matrices","lr":0.025}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"embeddings and scalars","embedding_lr":0.035,"scalar_lr":0.025}
Weight Averaging
SWA
parameters: {"frequency":"every 50 steps when scale < 0.2"}
EMA
parameters: {"decay":0.997}
Quantization
int6 QAT
bits: 6
scope: MLP and attention weights
int8
bits: 8
scope: embeddings
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
Other
other
Training data replay for the last 100 batches over 2 epochs.
parameters: {"epochs":2,"batches":100}
other
Self-distillation using an EMA teacher for 50 steps.
parameters: {"teacher":"EMA","steps":50,"temperature":2,"alpha":0.7}

Novel Contributions

  • Recursive weight sharing / fractal looping of 6 unique transformer blocks to create 12 effective layers
  • Reinvesting parameter savings into a 4x MLP expansion
  • Orthogonal loop position embeddings to distinguish repeated passes through shared blocks
  • U-Net skip connections within each loop iteration
  • Combination of SmearGate, BigramHash, shared value embeddings, and XSA in a compact transformer
  • Late QAT, training replay, self-distillation, SWA, and EMA used together under the artifact budget