val_bpb
0.1434
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.4 MB
Training Techniques
Quantization
GPTQ
bits: 5
scope: all
Architecture
LeakyReLU(0.9)^2
Uses a LeakyReLU squared activation variant in the transformer.
parameters: {"slope":0.9}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"embeddings_optimizer":"AdamW"}
Weight Averaging
EMA
parameters: {"decays":[0.995,0.996,0.997]}
Evaluation
two-pass n-gram rescoring
parameters: {"rescore_chunks":15,"cold_cache_rescoring":true}
Test-Time Training
score-first TTT
parameters: {"optimizer":"AdamW","temperature":0.98,"chunk_size":2048}
Other
other
Entropy-adaptive order-2-to-9 n-gram backoff with 4M hash buckets.
parameters: {"order_range":"2-9","hash_buckets":4000000}
Novel Contributions
- Two-pass n-gram evaluation that rescoring early chunks with the complete cache
- Cold-cache penalty reduction for early validation chunks
- Backward-looking compliant rescoring of tokens already evaluated in pass 1
- Combination of score-first TTT, GPTQ-Int5 export, and n-gram rescoring in a single pipeline