val_bpb
1.2185
Architecture
Transformer
Optimizer
—
Artifact Size
~14.4 MB
Training Techniques
Quantization
STE QAT
bits: 2
scope: all weights
Architecture
KV head count
Uses 8 attention heads with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Uses a 3x MLP expansion.
parameters: {"multiplier":3}
XSA
Applies XSA in the last 4 layers.
parameters: {"layers":4}
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Other
other
Uses BitNet-style ternary packing with base-3 encoding, packing 5 ternary values per byte.
parameters: {"values_per_byte":5}
Novel Contributions
- Introduces BitNet-style ternary quantization {-1, 0, +1} for the challenge submission.
- Demonstrates that ternary quantization allows roughly 2x more parameters within the same size budget.
- Finds that EMA is incompatible with ternary quantization and should be disabled.
- Uses base-3 packing to store five ternary values per byte.
- Reports a ternary QAT implementation with STE-based training.