PR #1407

open

Non-record: Extended Compute Scaling Analysis (20K steps, 1.0960 BPB, 3 seeds (each run ~6 hours on 4xA100MIG))

by OnlyJundongView on GitHub
val_bpb
1.0960
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.05MB

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"layers":11,"dimensions":512,"heads":8,"kv_heads":4}
BigramHash
Bigram hash embedding component.
parameters: {"size":1536}
XSA
Applies XSA in the last 4 layers.
parameters: {"layers":4}
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":16,"total_dimensions":64}
VE128
Value embedding / residual enhancement module.
parameters: {"layers":[9,10]}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
GPTQ-lite
bits: 6
scope: model + code
Compression
lzma
level: null
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9,"batch_seqs":32,"grad_clip":1}
LR Schedule
warmdown
parameters: {"iterations":20000,"warmdown_iters":7800,"muon_momentum_warmup_steps":3340}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":3340,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Novel Contributions

  • Extended the PR #549 method to 20K training steps under unlimited compute.
  • Showed that post-TTT BPB improves to 1.0960 on a 3-seed mean.
  • Demonstrated non-monotonic artifact size behavior: mid-training checkpoints exceed 16MB, but the final warmdown-completed model fits at about 15.05MB.
  • Analyzed scaling behavior across training steps, showing rapid early BPB gains followed by a warmdown-driven final drop.
  • Confirmed that TTT gains scale with compute, with a -0.006 BPB gain at 20K steps.