PR #1413
RECORDopenRecord: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)
by dexhunterView on GitHub
val_bpb
1.0828
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: all weights; int8 embeddings
Architecture
weight tying
Tied token embeddings
parameters: null
Partial RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16}
depth recurrence
Loops layers 4-5 twice during training
parameters: {"layers":[4,5],"loops":2}
LeakyReLU
LeakyReLU squared activation
parameters: {"slope":0.5}
Optimizer
Muon
weight_decay: 0.085
momentum: null
other_params: {"variant":"MuonEq-R"}
Weight Averaging
EMA
parameters: null
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"freeze_blocks":0}
Regularization
layerwise LN scale
parameters: null
Other
other
QK_GAIN_INIT increased from 4.0 to 5.0
parameters: {"qk_gain_init":5}
Novel Contributions
- Raised QK_GAIN_INIT from 4.0 to 5.0 on the SP8192 baseline
- Added a legal score-first test-time training pass
- Achieved 1.08279 val_bpb as a 3-seed mean
- Kept the submission under the 16 MB artifact limit