PR #367

open

Non-record: BitNet b1.58 - 68M ternary params, val_bpb=1.1770, systematic analysis of ternary limitations

by ksang123View on GitHub

val_bpb

1.1770

Architecture

Transformer

Optimizer

—

Artifact Size

15.88MB

Training Techniques

Quantization

ternary QAT

bits: 2

scope: all projections

Architecture

BitLinear

Ternary {-1, 0, 1} linear layers used for all attention and MLP projections with per-group absmax STE.

parameters: null

MLP3.25x

Widened MLP from 3x to 3.25x to add parameters at low artifact cost.

parameters: {"hidden":2496}

GQA

Grouped-query attention with 6 KV heads.

parameters: {"heads":12,"kv_heads":6}

U-Net skip connections

Added skip connections in the network.

parameters: null

tied embeddings

Used tied fp16 embeddings.

parameters: null

RoPE

Rotary positional embeddings with a large base.

parameters: {"base":200000}

logit softcap

Applied logit softcap to outputs.

parameters: {"value":30}

LR Schedule

warmdown

parameters: {"longer_warmdown":true}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Weight Averaging

EMA/SWA

parameters: null

Initialization

OrthoInit

Orthogonal initialization used in some ablations; found to have no effect for ternary models.

Test-Time Training

TTT

parameters: {"learning_rate":0.002}

Evaluation

sliding window eval

parameters: {"stride":64}

Compression

lzma

level: null

Other

other

Base-3 packing of ternary weights at 1.6 bits/parameter.

parameters: {"bits_per_param":1.6}

other

fp16 scale simulation during training to match serialization precision and reduce roundtrip gap.

parameters: {"roundtrip_gap":0.0016}

Novel Contributions

Systematic negative-results analysis of techniques that break or do not help ternary models
Near-lossless ternary quantization roundtrip via fp16 scale simulation during training
Demonstrated that ternary prefers higher learning rate, no regularization, and longer warmdown
Showed that base-3 packing can store 68M ternary parameters in 15.88MB
Suggested int4 with late QAT as an unexplored middle ground