PR #1005

open

[Non-Record] Extended Compute Scaling Analysis: 1.0853 BPB at 50K steps (11.5 hours) on 4×A100MIG

by OnlyJundongView on GitHub
val_bpb
1.0853
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
14.35 MB

Training Techniques

Architecture
BigramHash
Bigram hash embedding component used in the model.
parameters: {"size":1536}
XSA
Attention/sequence architecture component applied to the last layers.
parameters: {"last_layers":4}
Partial RoPE
Rotary positional embeddings applied partially.
parameters: {"dimensions":16,"total_dimensions":64}
MLP3x
Three-times MLP expansion with LeakyReLU squared activation.
parameters: null
LeakyReLU
LeakyReLU squared activation used in the MLP.
parameters: {"squared":true,"slope":0.5}
VE128
Value residual enhancement module.
parameters: {"layers":[9,10],"dimension":128}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
LR Schedule
warmdown
parameters: {"warmdown_iters":19500,"muon_momentum_warmup_steps":8350,"max_wallclock_seconds":0}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":8350}

Novel Contributions

  • Extended compute scaling analysis of the record-track SOTA beyond the 10-minute wall-clock limit
  • Demonstrated 1.0853 BPB at 50K steps on 4×A100 MIG
  • Showed that artifact size is non-monotonic during training and recovers below 16MB after warmdown
  • Analyzed diminishing returns in BPB beyond roughly 30K steps
  • Compared 20K and 50K step runs against the record-track SOTA and quantified TTT gains scaling with compute