PR #367
openNon-record: BitNet b1.58 - 68M ternary params, val_bpb=1.1770, systematic analysis of ternary limitations
by ksang123View on GitHub
val_bpb
1.1770
Architecture
Transformer
Optimizer
—
Artifact Size
15.88MB
Training Techniques
Quantization
ternary QAT
bits: 2
scope: all projections
Architecture
BitLinear
Ternary {-1, 0, 1} linear layers used for all attention and MLP projections with per-group absmax STE.
parameters: null
MLP3.25x
Widened MLP from 3x to 3.25x to add parameters at low artifact cost.
parameters: {"hidden":2496}
GQA
Grouped-query attention with 6 KV heads.
parameters: {"heads":12,"kv_heads":6}
U-Net skip connections
Added skip connections in the network.
parameters: null
tied embeddings
Used tied fp16 embeddings.
parameters: null
RoPE
Rotary positional embeddings with a large base.
parameters: {"base":200000}
logit softcap
Applied logit softcap to outputs.
parameters: {"value":30}
LR Schedule
warmdown
parameters: {"longer_warmdown":true}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Weight Averaging
EMA/SWA
parameters: null
Initialization
OrthoInit
Orthogonal initialization used in some ablations; found to have no effect for ternary models.
Test-Time Training
TTT
parameters: {"learning_rate":0.002}
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
lzma
level: null
Other
other
Base-3 packing of ternary weights at 1.6 bits/parameter.
parameters: {"bits_per_param":1.6}
other
fp16 scale simulation during training to match serialization precision and reduce roundtrip gap.
parameters: {"roundtrip_gap":0.0016}
Novel Contributions
- Systematic negative-results analysis of techniques that break or do not help ternary models
- Near-lossless ternary quantization roundtrip via fp16 scale simulation during training
- Demonstrated that ternary prefers higher learning rate, no regularization, and longer warmdown
- Showed that base-3 packing can store 68M ternary parameters in 15.88MB
- Suggested int4 with late QAT as an unexplored middle ground