PR #1903

open

Record: 0.9418 BPB — BigramHash + MuonEqR + 3L-Recurrence + SDClip (3-seed mean)

by GrishaKhumaryanView on GitHub
val_bpb
0.9418
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.83 MiB

Training Techniques

Architecture
BigramHash
Injects bigram hash embeddings into the residual stream to capture local patterns.
parameters: {"vocab_size":4096,"dim":32}
depth recurrence
Progressive recurrence over the last layers using a repeated layer pattern.
parameters: {"layers":6,"pattern":[0,1,2,3,4,5,3,4,5]}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R"}
Quantization
int6
bits: 6
scope: model weights
Test-Time Training
score-first TTT
parameters: {"passes":2}
Weight Averaging
EMA
parameters: null
Compression
lzma
level: null

Novel Contributions

  • BigramHash skip connections
  • 3-layer progressive recurrence with repeated layer pattern
  • MuonEq-R optimizer
  • SDClip 6-bit sigma-delta quantization
  • Legal score-first two-pass test-time training
  • EMA with parallel residual blocks