PR #1005
open[Non-Record] Extended Compute Scaling Analysis: 1.0853 BPB at 50K steps (11.5 hours) on 4×A100MIG
by OnlyJundongView on GitHub
val_bpb
1.0853
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
14.35 MB
Training Techniques
Architecture
BigramHash
Bigram hash embedding component used in the model.
parameters: {"size":1536}
XSA
Attention/sequence architecture component applied to the last layers.
parameters: {"last_layers":4}
Partial RoPE
Rotary positional embeddings applied partially.
parameters: {"dimensions":16,"total_dimensions":64}
MLP3x
Three-times MLP expansion with LeakyReLU squared activation.
parameters: null
LeakyReLU
LeakyReLU squared activation used in the MLP.
parameters: {"squared":true,"slope":0.5}
VE128
Value residual enhancement module.
parameters: {"layers":[9,10],"dimension":128}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
LR Schedule
warmdown
parameters: {"warmdown_iters":19500,"muon_momentum_warmup_steps":8350,"max_wallclock_seconds":0}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":8350}
Novel Contributions
- Extended compute scaling analysis of the record-track SOTA beyond the 10-minute wall-clock limit
- Demonstrated 1.0853 BPB at 50K steps on 4×A100 MIG
- Showed that artifact size is non-monotonic during training and recovers below 16MB after warmdown
- Analyzed diminishing returns in BPB beyond roughly 30K steps
- Compared 20K and 50K step runs against the record-track SOTA and quantified TTT gains scaling with compute