PR #1005

open

[Non-Record] Extended Compute Scaling Analysis: 1.0853 BPB at 50K steps (11.5 hours) on 4×A100MIG

by OnlyJundongView on GitHub

val_bpb

1.0853

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

14.35 MB

Training Techniques

Architecture

BigramHash

Bigram hash embedding component used in the model.

parameters: {"size":1536}

XSA

Attention/sequence architecture component applied to the last layers.

parameters: {"last_layers":4}

Partial RoPE

Rotary positional embeddings applied partially.

parameters: {"dimensions":16,"total_dimensions":64}

MLP3x

Three-times MLP expansion with LeakyReLU squared activation.

parameters: null

LeakyReLU

LeakyReLU squared activation used in the MLP.

parameters: {"squared":true,"slope":0.5}

VE128

Value residual enhancement module.

parameters: {"layers":[9,10],"dimension":128}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

LR Schedule

warmdown

parameters: {"warmdown_iters":19500,"muon_momentum_warmup_steps":8350,"max_wallclock_seconds":0}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":8350}

Novel Contributions

Extended compute scaling analysis of the record-track SOTA beyond the 10-minute wall-clock limit
Demonstrated 1.0853 BPB at 50K steps on 4×A100 MIG
Showed that artifact size is non-monotonic during training and recovers below 16MB after warmdown
Analyzed diminishing returns in BPB beyond roughly 30K steps
Compared 20K and 50K step runs against the record-track SOTA and quantified TTT gains scaling with compute