PR #1811

open

Non-record: BitNet 65M params — val_bpb 1.235

by peytontolbertView on GitHub
val_bpb
1.2350
Architecture
Transformer
Optimizer
Muon
Artifact Size
14,330,708 bytes

Training Techniques

Quantization
STE QAT
bits: null
scope: weights
Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Grouped query attention with fewer KV heads than query heads.
parameters: {"heads":16,"kv_heads":4}
RoPE
Uses YaRN/RoPE context extension.
parameters: {"context_length":4096}
depth recurrence
Training and evaluation use depth recurrence.
parameters: {"training_depth_recurrence":1,"evaluation_depth_recurrence":1}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrix_parameters":true,"scalar_embedding_parameters":"Adam"}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Regularization
logit softcap
parameters: {"value":30}
LR Schedule
warmup
parameters: {"warmup_steps":1}
Other
other
Runtime-row ternary scaling aligned to Model Stack's packed BitNet runtime export format.
parameters: {"scale_layout":"runtime_row","group_size":64}

Novel Contributions

  • Non-record 65M-parameter BitNet-style ternary transformer submission
  • Runtime-row ternary scaling matched to Model Stack packed BitNet inference layout
  • Exact packed BitNet runtime export with zero skipped tensors and zero packed-weight reconstruction error
  • Near-zero roundtrip gap between pre-roundtrip and final val_bpb
  • Training a larger ternary model within the 16MB Parameter Golf artifact budget