PR #1217
openRecord: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean)
by bigbagView on GitHub
val_bpb
1.1027
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.80 MB
Training Techniques
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"MuonEq-R":true,"row_normalization_before_newton_schulz":true}
Other
other
QK_GAIN_INIT set to 5.0 as a tuned hyperparameter for attention scaling
parameters: {"qk_gain_init":5}
other
Context-only SLOT: optimize delta using only already-scored context tokens, excluding future tokens from the loss
parameters: {"slot_steps":8,"slot_lr":0.005}
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Test-Time Training
score-first TTT
parameters: {"learning_rate":null}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Novel Contributions
- MuonEq-R optimizer variant with row-normalization before Newton-Schulz orthogonalization
- QK_GAIN_INIT=5.0 hyperparameter sweep and selection
- Context-only SLOT evaluation variant that uses only past/context tokens for delta optimization
- Combined improvement over prior PR #1179 base to reach 1.1027 val_bpb