PR #1217

open

Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean)

val_bpb
1.1027
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.80 MB

Training Techniques

Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"MuonEq-R":true,"row_normalization_before_newton_schulz":true}
Other
other
QK_GAIN_INIT set to 5.0 as a tuned hyperparameter for attention scaling
parameters: {"qk_gain_init":5}
other
Context-only SLOT: optimize delta using only already-scored context tokens, excluding future tokens from the loss
parameters: {"slot_steps":8,"slot_lr":0.005}
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Test-Time Training
score-first TTT
parameters: {"learning_rate":null}
Sequence Length
sequence_length
train_length: null
eval_length: 2048

Novel Contributions

  • MuonEq-R optimizer variant with row-normalization before Newton-Schulz orthogonalization
  • QK_GAIN_INIT=5.0 hyperparameter sweep and selection
  • Context-only SLOT evaluation variant that uses only past/context tokens for delta optimization
  • Combined improvement over prior PR #1179 base to reach 1.1027 val_bpb