PR #317

open

Record: 11L XSA4 + EMA + TTT + Int6 MLP3x (val_bpb=1.1442)

by chris-buckleyView on GitHub
val_bpb
1.1442
Architecture
Transformer
Optimizer
Muon/AdamW
Artifact Size
under 16 MB

Training Techniques

Architecture
XSA
XSA applied to the last 4 layers
parameters: {"layers":4}
MLP3x
3x MLP width
parameters: null
SmearGate
Uses SmearGate in the model stack
parameters: null
BigramHash
Uses BigramHash auxiliary component with vocabulary size 2048
parameters: {"vocab_size":2048}
KV head count
Uses 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Quantization
int6 mixed
bits: 6
scope: all
Weight Averaging
EMA
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"adamw_used":true}
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Initialization
OrthoInit
Orthogonal initialization with muP-style output scaling
Evaluation
stride-based sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"momentum":0.9,"freeze_blocks":2}
Compression
zstd
level: 22
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
fixed learning rates
parameters: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Novel Contributions

  • Adds full-model SGD test-time training on the dequantized checkpoint
  • Uses EMA instead of SWA in the winning public training stack
  • Applies XSA to the last 4 layers
  • Uses stride-64 evaluation
  • Tunes learning rates upward for matrix, scalar, and tied embedding parameters
  • Includes compatibility fallbacks for FA3 to SDPA and manual GQA KV-head repeat