PR #473
closedRecord: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean)
by abaybektursunView on GitHub
val_bpb
1.1214
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~16.0 MB
Training Techniques
Quantization
GPTQ-lite
bits: 6
scope: model weights
Architecture
XSA
Applies XSA to the last 4 layers
parameters: {"layers":4}
Partial RoPE
Uses partial rotary positional embeddings
parameters: {"dimensions":16,"base":64}
SmearGate
Adds SmearGate to the model
parameters: null
BigramHash
Uses a larger BigramHash vocabulary
parameters: {"vocab_size":3072}
VE
Enables VE on selected layers
parameters: {"dimensions":128,"layers":[9,10]}
MLP3x
Uses a 3x MLP with relu² activation
parameters: {"multiplier":3}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
SGD
weight_decay: null
momentum: 0.9
other_params: {"used_for":"TTT adaptation","learning_rate":0.002,"epochs":3,"gradient_clip":1,"batch_size":32}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency":50}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"chunk_size":32768,"epochs":3,"learning_rate":0.002,"optimizer":"SGD + momentum","freeze_blocks":0,"gradient_clip":1,"batch_size":32}
Sequence Length
sequence_length
train_length: null
eval_length: 32768
LR Schedule
cosine decay
parameters: {"across_chunks":true}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Other
other
Parameter Banking with contiguous 3D banks replacing 66 nn.Linear weights and Parallel Muon communication strategy using reduce-scatter, local NS, and all-gather
parameters: {"banks":4,"replaced_linear_layers":66}
Novel Contributions
- Legal backward-looking score-first TTT framework
- Parallel Muon optimizer with Parameter Banking
- Improved BigramHash vocabulary size from 2048 to 3072
- Reduced TTT freeze depth from 2 to 0
- 3-seed mean record submission with val_bpb 1.1214