PR #1789

closed

Record: Nairi-Micro - 0.9982 BPB

val_bpb
0.9982
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.8 MB

Training Techniques

Architecture
Transformer
10-layer, 576-dimensional Transformer with increased capacity near the artifact size limit.
parameters: {"layers":10,"dimensions":576}
Quantization
mixed int5/int6
bits: null
scope: weights
QAT
bits: null
scope: weights
Test-Time Training
full TTT
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"schedule":"WSD"}
LR Schedule
warmup-stable-decay
parameters: null

Novel Contributions

  • 10-layer 576-dimensional Transformer at the edge of the 16MB constraint
  • Mixed-precision int5/int6 quantization with QAT
  • Legal test-time training adaptation
  • Muon optimizer with WSD scheduling