PR #977

open

LeakyReLU(0.75)² + Legal TTT + Parallel Muon — 1.1185 BPB (3-seed mean)

by michaelwinczukView on GitHub
val_bpb
1.1185
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.96MB

Training Techniques

Architecture
LeakyReLU
Changed MLP activation negative slope from 0.5 to 0.75, with squared activation.
parameters: {"negative_slope":0.75}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.027}
Test-Time Training
full TTT
parameters: {"legal":true}
LR Schedule
warmdown
parameters: {"warmdown_iters":3700}
Evaluation
sliding window eval
parameters: null

Novel Contributions

  • Swept LeakyReLU negative_slope and found 0.75 outperforms the SOTA default 0.5
  • Minor learning-rate tuning with MATRIX_LR=0.027
  • Extended warmdown schedule to 3700 iterations
  • Legal test-time training with 3-seed mean validation
  • Parallel Muon optimizer setup