PR #1217

open

Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean)

val_bpb

1.1027

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.80 MB

Training Techniques

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"MuonEq-R":true,"row_normalization_before_newton_schulz":true}

Other

other

QK_GAIN_INIT set to 5.0 as a tuned hyperparameter for attention scaling

parameters: {"qk_gain_init":5}

other

Context-only SLOT: optimize delta using only already-scored context tokens, excluding future tokens from the loss

parameters: {"slot_steps":8,"slot_lr":0.005}

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Test-Time Training

score-first TTT

parameters: {"learning_rate":null}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

MuonEq-R optimizer variant with row-normalization before Newton-Schulz orthogonalization
QK_GAIN_INIT=5.0 hyperparameter sweep and selection
Context-only SLOT evaluation variant that uses only past/context tokens for delta optimization
Combined improvement over prior PR #1179 base to reach 1.1027 val_bpb