PR #79

open

Depth Recurrence: 3x3x1024 (non-record, pending H100)

by MarvbusterView on GitHub
val_bpb
1.8698
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.7MB

Training Techniques

Architecture
depth recurrence
3 unique transformer blocks are repeated 3 times for an effective depth of 9, reusing blocks across repeats without U-Net skip connections.
parameters: {"unique_blocks":3,"repeats":3,"effective_depth":9,"dim":1024}
tied embeddings
Input and output embeddings are tied to reduce parameters.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":24,"kv_heads":12}
Quantization
QAT
bits: 6
scope: all
Optimizer
Muon
weight_decay: null
momentum: 0.85
other_params: {"matrix_lr":0.02,"muon_backend_steps":7,"qk_gain_init":2,"qk_gain":2}
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
NorMuon training variant used alongside Int6 QAT.
parameters: null

Novel Contributions

  • Depth recurrence with 3 unique transformer blocks repeated 3 times
  • Trading architectural diversity for width to fit a larger dimension within the parameter budget
  • Int6 QAT to increase parameter capacity within the 16MB artifact budget
  • Use of NorMuon, which reportedly improved BPB
  • Sliding window evaluation with stride 64
  • Systematic search over multiple architectural strategies and hyperparameters