PR #666

open

Non-record: BitNet Ternary — 65M params in 15.9MB (1.1932 BPB)

by chrislovescodingView on GitHub
val_bpb
1.1932
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,878,267 bytes

Training Techniques

Quantization
ternary
bits: 2
scope: all weights
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"layers":12,"dimensions":768,"heads":12,"kv_heads":6}
MLP3x
Expanded MLP with 3x hidden size.
parameters: {"hidden":2304}
U-Net skip connections
Adds skip connections in a U-Net-like pattern to the transformer.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02}
Compression
zlib
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Activation schedule with full-precision training for the first 30% of wallclock, then ternary STE for the remaining 70%.
parameters: {"ternary_start_frac":0.3}
other
Straight-Through Estimator ternary training with per-row mean-absolute scaling and thresholding to {-1, 0, +1}.
parameters: {"threshold_multiplier":0.7}

Novel Contributions

  • Trains a 65M-parameter model within a 15.9MB artifact budget using ternary weight quantization.
  • Uses ternary STE training from the start to achieve a near-zero quantization gap.
  • Demonstrates that a much larger model can fit in the same budget as smaller int6 submissions.
  • Combines ternary quantization with grouped-query attention, tied embeddings, and U-Net skip connections.
  • Applies a staged training schedule that switches from full precision to ternary STE after 30% of wallclock.