PR #1128

open

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM

by AnubhavBharadwaajView on GitHub
val_bpb
1.1154
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.9 MB

Training Techniques

Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9,"batch_seqs":32,"grad_clip":1}
Other
other
SLOT (sample-specific LM optimization at test time) optimizing a per-batch additive delta on the last hidden layer during evaluation
parameters: {"delta_shape":[1,1,512],"steps":5,"learning_rate":0.003}
Architecture
LeakyReLU
LeakyReLU squared MLP activation used in the model
parameters: {"mlp_layers":3}
BigramHash
Bigram hash embedding component
parameters: {"vocab_size":1536}
XSA
XSA attention-related modification
parameters: {"last_n":4}
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16}
VE128
Value residual enhancement module
parameters: {"dim":128,"layers":[9,10]}
Regularization
LN scale
parameters: null
Weight Averaging
EMA + Tight SWA
parameters: {"decay":0.997,"swa_every":50}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: 6
Evaluation
stride-based eval
parameters: {"stride":64}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_start":0.92,"warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"eps":0.00001}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}

Novel Contributions

  • First SLOT-based entry in Parameter Golf
  • Per-batch test-time optimization of a 512-dimensional delta at the last hidden layer
  • Combination of SLOT with legal score-first TTT
  • Parallel Muon-based training with the existing PR #549 base architecture
  • Record-setting 3-seed mean val_bpb of 1.1154