PR #1903
openRecord: 0.9418 BPB — BigramHash + MuonEqR + 3L-Recurrence + SDClip (3-seed mean)
by GrishaKhumaryanView on GitHub
val_bpb
0.9418
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.83 MiB
Training Techniques
Architecture
BigramHash
Injects bigram hash embeddings into the residual stream to capture local patterns.
parameters: {"vocab_size":4096,"dim":32}
depth recurrence
Progressive recurrence over the last layers using a repeated layer pattern.
parameters: {"layers":6,"pattern":[0,1,2,3,4,5,3,4,5]}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R"}
Quantization
int6
bits: 6
scope: model weights
Test-Time Training
score-first TTT
parameters: {"passes":2}
Weight Averaging
EMA
parameters: null
Compression
lzma
level: null
Novel Contributions
- BigramHash skip connections
- 3-layer progressive recurrence with repeated layer pattern
- MuonEq-R optimizer
- SDClip 6-bit sigma-delta quantization
- Legal score-first two-pass test-time training
- EMA with parallel residual blocks