PR #1424

open

Non-record: Extended Compute Scaling Analysis (50K steps, 1.0858 BPB, 3 seeds (each run ~12 hours on 4xA100MIG))

by OnlyJundongView on GitHub

val_bpb

1.0858

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

14.30 MB

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"layers":11,"dimensions":512,"heads":8,"kv_heads":4}

BigramHash

Bigram hash embedding component.

parameters: {"size":1536}

XSA

Applies XSA to the last 4 layers.

parameters: {"last_n_layers":4}

RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":16,"total_dimensions":64}

VE128

Value embedding / residual component with dimension 128.

parameters: {"dimension":128,"layers":[9,10]}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Quantization

GPTQ-lite

bits: 6

scope: model weights

Compression

lzma

level: null

Test-Time Training

full TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9,"batch_seqs":32,"grad_clip":1}

LR Schedule

warmdown

parameters: {"warmdown_iters":7800,"iterations":20000,"muon_momentum_warmup_steps":3340}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":3340}

Novel Contributions

Extended the record-track architecture to 50K training steps under unlimited compute.
Showed that post-TTT BPB improves from 1.0960 at 20K steps to 1.0858 at 50K steps.
Demonstrated non-monotonic artifact size behavior: mid-training checkpoints exceed 16MB, but the final warmdown-completed model fits under budget.
Analyzed scaling across 3 seeds and compared 20K vs 50K compute regimes.
Observed that warmdown restores compressibility and enables the final model to remain sub-16MB.