val_bpb
1.0786
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,979,228 bytes
Training Techniques
Quantization
mixed int6/int8
bits: null
scope: all
Architecture
depth recurrence
Uses 3-layer depth recurrence in the model stack.
parameters: {"layers":3}
GQA
Uses 8 query heads and 4 KV heads.
parameters: {"query_heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.9
other_params: {"mlp_weight_decay":0.115}
Compression
Brotli
level: null
Test-Time Training
score-first TTT
parameters: {"enabled":1,"learning_rate":0.005,"epochs":4,"momentum":0.9,"chunk_tokens":32768}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.72}
Regularization
weight decay
parameters: {"muon_wd":0.095,"muon_wd_mlp":0.115}
Other
other
Uses a legal score-first TTT setup with a wallclock reserve and a rerun plan for valid logs.
parameters: {"max_wallclock_seconds":590,"gptq_reserve_seconds":0.5}
Novel Contributions
- SP8192 cached data and tokenizer
- 11-layer 512d Transformer with 8 query heads and 4 KV heads
- 3-layer depth recurrence
- parallel residual lanes
- QK gain 5.25
- Muon-style optimizer with split weight decay
- EMA
- GPTQ int6/int8 compression plus Brotli
- legal score-first TTT
- wallclock-reserved rerun plan