val_bpb
1.0805
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB
Training Techniques
Architecture
depth recurrence
Layers 3 through 5 are executed twice per forward pass to increase effective depth without increasing parameter count.
parameters: {"layers":[3,4,5],"repeats":2}
Parallel Residuals
Attention and MLP are processed in parallel to widen the model within the same latency budget.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Test-Time Training
score-first TTT
parameters: null
Quantization
GPTQ
bits: null
scope: all
Compression
Brotli
level: null
Novel Contributions
- SP8192 tokenizer for improved compression on FineWeb
- Depth recurrence on layers 3-5
- Parallel residual attention/MLP processing
- Muon optimizer
- Score-first test-time training
- GPTQ with SDClip
- Brotli-compressed state dictionary