PR #139
openNon-record: BitNet b1.58 — 65M ternary params beat 4-hour baseline in 10 minutes (val_bpb=1.2029)
by ksang123View on GitHub
val_bpb
1.2029
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.11 MB
Training Techniques
Quantization
STE QAT
bits: 2
scope: all linear layers (attention and MLP); ternary {-1, 0, 1} weights
Architecture
BitLinear
All linear layers use ternary weight quantization with per-group scaling and STE gradients.
parameters: {"group_size":64}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":12,"kv_heads":6}
MLP3x
Expanded MLP with 3x hidden dimension.
parameters: {"mlp_multiplier":3,"hidden_dim":2304}
RoPE
Rotary positional embeddings with a larger base.
parameters: {"base":200000}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.04,"scalar_lr":0.04,"tied_embedding_lr":0.03,"warmup_momentum_start":0.92,"warmup_steps":1500}
LR Schedule
linear warmup + wallclock-aware linear warmdown
parameters: {"warmup_steps":50,"warmdown_steps":1200}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Compression
lzma
level: null
Other
other
fp16 scale simulation during training using .half().float() to match stored scale precision and reduce quantization gap.
parameters: null
other
Base-3 packing of ternary weights with 5 trits per byte for lossless artifact storage.
parameters: {"trits_per_byte":5}
Novel Contributions
- Uses BitNet b1.58 ternary weights to fit 64.5M parameters into a 15.1MB artifact.
- Achieves near-zero quantization gap by training with ternary quantization active in every forward pass.
- Uses fp16 scale simulation (.half().float()) so training matches stored scale precision.
- Applies base-3 packing (5 trits per byte) for lossless, compact artifact storage.
- Demonstrates that a 10-minute ternary model can beat a 4-hour full-precision baseline under the same size budget.
- Argues that Chinchilla scaling under a fixed artifact-size constraint favors more low-precision parameters over fewer high-precision parameters.