PR #473

closed

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean)

by abaybektursunView on GitHub

val_bpb

1.1214

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~16.0 MB

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: model weights

Architecture

XSA

Applies XSA to the last 4 layers

parameters: {"layers":4}

Partial RoPE

Uses partial rotary positional embeddings

parameters: {"dimensions":16,"base":64}

SmearGate

Adds SmearGate to the model

parameters: null

BigramHash

Uses a larger BigramHash vocabulary

parameters: {"vocab_size":3072}

Enables VE on selected layers

parameters: {"dimensions":128,"layers":[9,10]}

MLP3x

Uses a 3x MLP with relu² activation

parameters: {"multiplier":3}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

SGD

weight_decay: null

momentum: 0.9

other_params: {"used_for":"TTT adaptation","learning_rate":0.002,"epochs":3,"gradient_clip":1,"batch_size":32}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency":50}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"chunk_size":32768,"epochs":3,"learning_rate":0.002,"optimizer":"SGD + momentum","freeze_blocks":0,"gradient_clip":1,"batch_size":32}

Sequence Length

sequence_length

train_length: null

eval_length: 32768

LR Schedule

cosine decay

parameters: {"across_chunks":true}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Other

other

Parameter Banking with contiguous 3D banks replacing 66 nn.Linear weights and Parallel Muon communication strategy using reduce-scatter, local NS, and all-gather

parameters: {"banks":4,"replaced_linear_layers":66}

Novel Contributions

Legal backward-looking score-first TTT framework
Parallel Muon optimizer with Parameter Banking
Improved BigramHash vocabulary size from 2048 to 3072
Reduced TTT freeze depth from 2 to 0
3-seed mean record submission with val_bpb 1.1214