PR #760

closed

Add BitNet b1.58 Ternary Quantization (non-record submission)

val_bpb
1.2185
Architecture
Transformer
Optimizer
Artifact Size
~14.4 MB

Training Techniques

Quantization
STE QAT
bits: 2
scope: all weights
Architecture
KV head count
Uses 8 attention heads with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Uses a 3x MLP expansion.
parameters: {"multiplier":3}
XSA
Applies XSA in the last 4 layers.
parameters: {"layers":4}
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Other
other
Uses BitNet-style ternary packing with base-3 encoding, packing 5 ternary values per byte.
parameters: {"values_per_byte":5}

Novel Contributions

  • Introduces BitNet-style ternary quantization {-1, 0, +1} for the challenge submission.
  • Demonstrates that ternary quantization allows roughly 2x more parameters within the same size budget.
  • Finds that EMA is incompatible with ternary quantization and should be disabled.
  • Uses base-3 packing to store five ternary values per byte.
  • Reports a ternary QAT implementation with STE-based training.