PR #1811

open

Non-record: BitNet 65M params — val_bpb 1.235

by peytontolbertView on GitHub

val_bpb

1.2350

Architecture

Transformer

Optimizer

Muon

Artifact Size

14,330,708 bytes

Training Techniques

Quantization

STE QAT

bits: null

scope: weights

Architecture

weight tying

Tied input and output embeddings.

parameters: null

GQA

Grouped query attention with fewer KV heads than query heads.

parameters: {"heads":16,"kv_heads":4}

RoPE

Uses YaRN/RoPE context extension.

parameters: {"context_length":4096}

depth recurrence

Training and evaluation use depth recurrence.

parameters: {"training_depth_recurrence":1,"evaluation_depth_recurrence":1}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"matrix_parameters":true,"scalar_embedding_parameters":"Adam"}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

Regularization

logit softcap

parameters: {"value":30}

LR Schedule

warmup

parameters: {"warmup_steps":1}

Other

other

Runtime-row ternary scaling aligned to Model Stack's packed BitNet runtime export format.

parameters: {"scale_layout":"runtime_row","group_size":64}

Novel Contributions

Non-record 65M-parameter BitNet-style ternary transformer submission
Runtime-row ternary scaling matched to Model Stack packed BitNet inference layout
Exact packed BitNet runtime export with zero skipped tensors and zero packed-weight reconstruction error
Near-zero roundtrip gap between pre-roundtrip and final val_bpb
Training a larger ternary model within the 16MB Parameter Golf artifact budget