PR #666

open

Non-record: BitNet Ternary — 65M params in 15.9MB (1.1932 BPB)

by chrislovescodingView on GitHub

val_bpb

1.1932

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,878,267 bytes

Training Techniques

Quantization

ternary

bits: 2

scope: all weights

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"layers":12,"dimensions":768,"heads":12,"kv_heads":6}

MLP3x

Expanded MLP with 3x hidden size.

parameters: {"hidden":2304}

U-Net skip connections

Adds skip connections in a U-Net-like pattern to the transformer.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02}

Compression

zlib

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Activation schedule with full-precision training for the first 30% of wallclock, then ternary STE for the remaining 70%.

parameters: {"ternary_start_frac":0.3}

other

Straight-Through Estimator ternary training with per-row mean-absolute scaling and thresholding to {-1, 0, +1}.

parameters: {"threshold_multiplier":0.7}

Novel Contributions

Trains a 65M-parameter model within a 15.9MB artifact budget using ternary weight quantization.
Uses ternary STE training from the start to achieve a near-zero quantization gap.
Demonstrates that a much larger model can fit in the same budget as smaller int6 submissions.
Combines ternary quantization with grouped-query attention, tied embeddings, and U-Net skip connections.
Applies a staged training schedule that switches from full precision to ternary STE after 30% of wallclock.