PR #274
open[Record] Stride-32 + Warmdown/Muon Tuning on SOTA #1: mean val_bpb=1.1403
by haikosysView on GitHub
val_bpb
1.1403
Architecture
Transformer
Optimizer
Muon
Artifact Size
under 16MB
Training Techniques
Quantization
mixed int6/int8
bits: 6
scope: MLP, attention, tied embeddings
Architecture
SmearGate
Uses SmearGate in the base architecture.
parameters: null
BigramHash
Adds BigramHash embedding component.
parameters: {"size":10240,"dim":128}
MLP3x
Uses a 3x expanded MLP hidden size.
parameters: {"hidden_size":1536}
tied embeddings
Uses FP16 tied embeddings.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: {"muon_momentum":0.95}
Weight Averaging
SWA
parameters: {"every":50,"start":"40%"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Initialization
OrthoInit
Orthogonal initialization.
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":5000}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Novel Contributions
- Stride-32 sliding window evaluation with 2x context overlap
- Warmdown tuning extended to 5000 iterations
- Muon momentum tuning from 0.99 to 0.95
- Reduced training batch tokens to 524288
- LoRA test-time training with rank-8 adapters during evaluation
- Per-document adapter reset and score-then-train ordering to preserve causality