PR #126

open

Non-record: BitNet b1.58 + depth recurrence + NorMuon (1.7510 BPB, 3.78 MB)

by Athenox14View on GitHub
val_bpb
1.7510
Architecture
Transformer
Optimizer
Muon
Artifact Size
3.78 MB

Training Techniques

Quantization
QAT
bits: 2
scope: all weights
Architecture
depth recurrence
4 unique transformer blocks are reused 3 times each for 12 effective layers, with U-Net style skip connections between encoder and decoder halves.
parameters: {"unique_layers":4,"recurrence_factor":3,"effective_layers":12}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
QK-norm
Applies RMSNorm to Q and K before RoPE.
parameters: null
logit softcapping
Uses tanh-based softcapping on logits.
parameters: {"cap":30}
RoPE
Uses NTK-aware RoPE base scaling with YaRN-style sequence length warmup.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"row_wise_rms_normalization":true,"newton_schulz_orthogonalization":true}
Compression
zlib
level: 9
Evaluation
sliding window eval
parameters: {"stride":"seq_len // 2","skip_cold_start_tokens":true}
Initialization
proj zero-init
Output projections of attention and MLP are zero-initialized so each block starts as identity.
resid_mix
Learnable per-block mixing of current hidden state with original embedding, initialized to [1, 0].
LR Schedule
linear warmup + constant + cosine cooldown
parameters: {"warmup_steps":100,"cooldown_steps":2000,"final_lr_multiplier":0.1}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Other
other
Sequence length warmup from 128 to 1024 over 2000 steps with NTK-aware RoPE base scaling (YaRN-style).
parameters: {"start_length":128,"end_length":1024,"warmup_steps":2000}

Novel Contributions

  • BitNet b1.58 ternary quantization with packed 2-bit weights and zlib compression
  • Depth recurrence with 4 unique transformer blocks reused 3 times for 12 effective layers
  • U-Net style skip connections across recurrent block passes
  • Learnable resid_mix parameter to blend recurrent hidden state with original embedding
  • NorMuon optimizer with per-neuron row-wise RMS normalization after Newton-Schulz orthogonalization
  • Sequence length warmup combined with YaRN / NTK-aware RoPE scaling
  • Sliding-window evaluation with cold-start token skipping
  • QK-norm and logit softcapping