PR #2009

open

Record: DepthShare4096 + SparseAttnGate + Muon TTT - val_bpb 1.0500312

val_bpb
1.0500
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,921,334 bytes

Training Techniques

Architecture
depth recurrence
8 base layers are reused for 3 recurrent passes, giving effective 24-layer depth with weight tying.
parameters: {"layers":8,"recurrent_passes":3,"effective_depth":24}
weight tying
Input/output embeddings are tied and recurrent blocks share weights across passes.
parameters: null
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"n_head":8,"n_kv_head":2}
Partial RoPE
Applies rotary embeddings to only part of the head dimensions.
parameters: {"rotary_pct":0.5}
SparseAttnGate
Learned per-head gating sparsifies attention weights below a threshold.
parameters: null
Test-Time Training
TTT
parameters: {"mode":"backward-only","adaptation_target":"layer norms"}
Compression
zlib
level: null
Optimizer
Muon
weight_decay: 0.01
momentum: 0.95
other_params: {"nesterov":true,"ns_steps":6,"lr":0.0095,"warmup_steps":200,"schedule":"cosine decay"}
LR Schedule
cosine decay
parameters: {"warmup_steps":200,"final_lr_multiplier":0.1}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Regularization
weight decay
parameters: {"value":0.01}

Novel Contributions

  • DepthShare-4096 depth-recurrent transformer with 8 layers reused for 3 passes
  • 4096-token BPE tokenizer to improve bytes-per-byte performance
  • SparseAttnGate attention sparsification
  • Partial RoPE with rotary_pct=0.5
  • Muon optimizer tuned with Newton-Schulz steps and momentum settings
  • Backward-only test-time training on layer norms
  • 3-seed statistically significant improvement over prior SOTA