PR #1891

open

Add Model Stack BitNet MLP2304 overtone run

by peytontolbertView on GitHub
val_bpb
1.2205
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,693,288 bytes

Training Techniques

Quantization
QAT
bits: null
scope: BitNet-compatible model
Architecture
weight tying
Tied embeddings were used in the model stack BitNet run.
parameters: null
KV head count
Used grouped-query style attention with fewer KV heads than query heads.
parameters: {"num_heads":16,"num_kv_heads":4}
depth recurrence
Training and evaluation used depth recurrence.
parameters: {"training":1,"evaluation":1}
RoPE
Used YARN RoPE variant for long-context handling.
parameters: {"type":"yarn"}
Initialization
OvertoneInit
Spectral embedding initialization with power-law spectrum S_k ~ k^-0.5.
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"muon_backend_steps":5}
Regularization
logit softcap
parameters: {"value":30}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
linear warmup
parameters: {"warmup_steps":1}

Novel Contributions

  • Model Stack-compatible runtime-row packed BitNet export
  • TrainableBitNetLinear QAT modules
  • Overtone spectral embedding initialization
  • MLP hidden dimension 2304 under the 16MB budget
  • Fused QKV with FlashAttention
  • Parallel Muon optimization
  • Dense training backward for grad-input and grad-weight when faster than the compiled step