PR #1424

open

Non-record: Extended Compute Scaling Analysis (50K steps, 1.0858 BPB, 3 seeds (each run ~12 hours on 4xA100MIG))

by OnlyJundongView on GitHub
val_bpb
1.0858
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
14.30 MB

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"layers":11,"dimensions":512,"heads":8,"kv_heads":4}
BigramHash
Bigram hash embedding component.
parameters: {"size":1536}
XSA
Applies XSA to the last 4 layers.
parameters: {"last_n_layers":4}
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":16,"total_dimensions":64}
VE128
Value embedding / residual component with dimension 128.
parameters: {"dimension":128,"layers":[9,10]}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
GPTQ-lite
bits: 6
scope: model weights
Compression
lzma
level: null
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9,"batch_seqs":32,"grad_clip":1}
LR Schedule
warmdown
parameters: {"warmdown_iters":7800,"iterations":20000,"muon_momentum_warmup_steps":3340}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":3340}

Novel Contributions

  • Extended the record-track architecture to 50K training steps under unlimited compute.
  • Showed that post-TTT BPB improves from 1.0960 at 20K steps to 1.0858 at 50K steps.
  • Demonstrated non-monotonic artifact size behavior: mid-training checkpoints exceed 16MB, but the final warmdown-completed model fits under budget.
  • Analyzed scaling across 3 seeds and compared 20K vs 50K compute regimes.
  • Observed that warmdown restores compressibility and enables the final model to remain sub-16MB.