PR #139

open

Non-record: BitNet b1.58 — 65M ternary params beat 4-hour baseline in 10 minutes (val_bpb=1.2029)

by ksang123View on GitHub

val_bpb

1.2029

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.11 MB

Training Techniques

Quantization

STE QAT

bits: 2

scope: all linear layers (attention and MLP); ternary {-1, 0, 1} weights

Architecture

BitLinear

All linear layers use ternary weight quantization with per-group scaling and STE gradients.

parameters: {"group_size":64}

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":12,"kv_heads":6}

MLP3x

Expanded MLP with 3x hidden dimension.

parameters: {"mlp_multiplier":3,"hidden_dim":2304}

RoPE

Rotary positional embeddings with a larger base.

parameters: {"base":200000}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.04,"scalar_lr":0.04,"tied_embedding_lr":0.03,"warmup_momentum_start":0.92,"warmup_steps":1500}

LR Schedule

linear warmup + wallclock-aware linear warmdown

parameters: {"warmup_steps":50,"warmdown_steps":1200}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Compression

lzma

level: null

Other

other

fp16 scale simulation during training using .half().float() to match stored scale precision and reduce quantization gap.

parameters: null

other

Base-3 packing of ternary weights with 5 trits per byte for lossless artifact storage.

parameters: {"trits_per_byte":5}

Novel Contributions

Uses BitNet b1.58 ternary weights to fit 64.5M parameters into a 15.1MB artifact.
Achieves near-zero quantization gap by training with ternary quantization active in every forward pass.
Uses fp16 scale simulation (.half().float()) so training matches stored scale precision.
Applies base-3 packing (5 trits per byte) for lossless, compact artifact storage.
Demonstrates that a 10-minute ternary model can beat a 4-hour full-precision baseline under the same size budget.
Argues that Chinchilla scaling under a fixed artifact-size constraint favors more low-precision parameters over fewer high-precision parameters.