PR #1413

RECORDopen

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.0828
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all weights; int8 embeddings
Architecture
weight tying
Tied token embeddings
parameters: null
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16}
depth recurrence
Loops layers 4-5 twice during training
parameters: {"layers":[4,5],"loops":2}
LeakyReLU
LeakyReLU squared activation
parameters: {"slope":0.5}
Optimizer
Muon
weight_decay: 0.085
momentum: null
other_params: {"variant":"MuonEq-R"}
Weight Averaging
EMA
parameters: null
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"freeze_blocks":0}
Regularization
layerwise LN scale
parameters: null
Other
other
QK_GAIN_INIT increased from 4.0 to 5.0
parameters: {"qk_gain_init":5}

Novel Contributions

  • Raised QK_GAIN_INIT from 4.0 to 5.0 on the SP8192 baseline
  • Added a legal score-first test-time training pass
  • Achieved 1.08279 val_bpb as a 3-seed mean
  • Kept the submission under the 16 MB artifact limit