PR #1465
openNon-record: v6.2 Phase 5a SOTA-trivial stack (3-seed re-run @66% = 1.138112; TTT 1.204 not competitive)
by sisegodView on GitHub
val_bpb
1.1381
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15 MB
Training Techniques
Quantization
mixed int6
bits: 6
scope: embeddings
Architecture
weight tying
Tied token embeddings with int6 quantization for the embedding path.
parameters: null
depth recurrence
Tested depth recurrence variants as an alternative architecture, though they were abandoned.
parameters: {"unique_layers":9,"recur":2,"effective_layers":18}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"ttt_muon":true,"newton_schulz":5}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"muon":true,"stride":64}
Regularization
weight decay
parameters: null
Novel Contributions
- Phase 5a trivial-wins composition combining prior improvements from QK gain initialization, Muon row normalization, EMA tuning, hidden multiplier re-investment, and int6 tied embeddings.
- 3-seed SLOT-100 re-run showing improved mid-eval and re-run validation bpb around 1.138112.
- Legal score-first Muon TTT was evaluated and found not competitive versus aggressive SLOT.
- Use of custom rANS entropy coding to pack the model into a sub-16MB artifact.
- Hidden multiplier increased from 4x to 5x as a byte re-investment that improved performance.
- Extensive negative ablations documenting unsuccessful compression and architecture ideas.