PR #760

closed

Add BitNet b1.58 Ternary Quantization (non-record submission)

val_bpb

1.2185

Architecture

Transformer

Optimizer

—

Artifact Size

~14.4 MB

Training Techniques

Quantization

STE QAT

bits: 2

scope: all weights

Architecture

KV head count

Uses 8 attention heads with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Uses a 3x MLP expansion.

parameters: {"multiplier":3}

XSA

Applies XSA in the last 4 layers.

parameters: {"layers":4}

Weight Averaging

EMA

parameters: null

SWA

parameters: null

Other

other

Uses BitNet-style ternary packing with base-3 encoding, packing 5 ternary values per byte.

parameters: {"values_per_byte":5}

Introduces BitNet-style ternary quantization {-1, 0, +1} for the challenge submission.
Demonstrates that ternary quantization allows roughly 2x more parameters within the same size budget.
Finds that EMA is incompatible with ternary quantization and should be disabled.
Uses base-3 packing to store five ternary values per byte.
Reports a ternary QAT implementation with STE-based training.