val_bpb
1.2205
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,693,288 bytes
Training Techniques
Quantization
QAT
bits: null
scope: BitNet-compatible model
Architecture
weight tying
Tied embeddings were used in the model stack BitNet run.
parameters: null
KV head count
Used grouped-query style attention with fewer KV heads than query heads.
parameters: {"num_heads":16,"num_kv_heads":4}
depth recurrence
Training and evaluation used depth recurrence.
parameters: {"training":1,"evaluation":1}
RoPE
Used YARN RoPE variant for long-context handling.
parameters: {"type":"yarn"}
Initialization
OvertoneInit
Spectral embedding initialization with power-law spectrum S_k ~ k^-0.5.
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"muon_backend_steps":5}
Regularization
logit softcap
parameters: {"value":30}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
linear warmup
parameters: {"warmup_steps":1}
Novel Contributions
- Model Stack-compatible runtime-row packed BitNet export
- TrainableBitNetLinear QAT modules
- Overtone spectral embedding initialization
- MLP hidden dimension 2304 under the 16MB budget
- Fused QKV with FlashAttention
- Parallel Muon optimization
- Dense training backward for grad-input and grad-weight when faster than the compiled step