PR #1407

open

Non-record: Extended Compute Scaling Analysis (20K steps, 1.0960 BPB, 3 seeds (each run ~6 hours on 4xA100MIG))

by OnlyJundongView on GitHub

val_bpb

1.0960

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.05MB

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"layers":11,"dimensions":512,"heads":8,"kv_heads":4}

BigramHash

Bigram hash embedding component.

parameters: {"size":1536}

XSA

Applies XSA in the last 4 layers.

parameters: {"layers":4}

RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":16,"total_dimensions":64}

VE128

Value embedding / residual enhancement module.

parameters: {"layers":[9,10]}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Quantization

GPTQ-lite

bits: 6

scope: model + code

Compression

lzma

level: null

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9,"batch_seqs":32,"grad_clip":1}

LR Schedule

warmdown

parameters: {"iterations":20000,"warmdown_iters":7800,"muon_momentum_warmup_steps":3340}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":3340,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Novel Contributions

Extended the PR #549 method to 20K training steps under unlimited compute.
Showed that post-TTT BPB improves to 1.0960 on a 3-seed mean.
Demonstrated non-monotonic artifact size behavior: mid-training checkpoints exceed 16MB, but the final warmdown-completed model fits at about 15.05MB.
Analyzed scaling behavior across training steps, showing rapid early BPB gains followed by a warmdown-driven final drop.
Confirmed that TTT gains scale with compute, with a -0.006 BPB gain at 20K steps.