PR #450

open

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)

by zachgoldfine44View on GitHub
val_bpb
1.1466
Architecture
Transformer
Optimizer
Muon
Artifact Size
14,385,363 bytes

Training Techniques

Architecture
Catalytic Residual Connections
Replace x + f(x) with x + c * f(x), where c is a learned per-dimension vector initialized to ones.
parameters: null
depth
Use a 12-layer Transformer stack.
parameters: {"layers":12}
BigramHash
Hash consecutive token pairs into a larger bigram embedding table and project to model dimension.
parameters: {"vocab_size":10240,"dim":128}
XSA
Cross-sequence attention applied on the last 4 layers.
parameters: {"layers":4}
KV head count
Grouped-query attention with 4 KV heads and 8 attention heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
MLP with 3x expansion and relu^2 activation.
parameters: {"hidden":1536}
Quantization
STE QAT
bits: 6
scope: all
Weight Averaging
SWA
parameters: {"start_fraction":0.8,"every_steps":50}
Optimizer
Muon
weight_decay: 0.042
momentum: 0.95
other_params: {"matrix_lr":0.04}
AdamW
weight_decay: 0.042
momentum: null
other_params: {"scope":"embeddings/scalars"}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections.
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":4000,"warmup_steps":20}
Other
other
Late QAT with threshold 0.25 using STE int6 quantization in the final portion of training.
parameters: {"threshold":0.25}

Novel Contributions

  • Catalytic residual connections with learned per-dimension residual scaling
  • 12-layer depth scaling as a sweet spot under the budget
  • BigramHash with 10240 buckets
  • Late QAT using STE int6 quantization
  • Stochastic Weight Averaging from the last 20% of warmdown