PR #1891

open

Add Model Stack BitNet MLP2304 overtone run

by peytontolbertView on GitHub

val_bpb

1.2205

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,693,288 bytes

Training Techniques

Quantization

QAT

bits: null

scope: BitNet-compatible model

Architecture

weight tying

Tied embeddings were used in the model stack BitNet run.

parameters: null

KV head count

Used grouped-query style attention with fewer KV heads than query heads.

parameters: {"num_heads":16,"num_kv_heads":4}

depth recurrence

Training and evaluation used depth recurrence.

parameters: {"training":1,"evaluation":1}

RoPE

Used YARN RoPE variant for long-context handling.

parameters: {"type":"yarn"}

Initialization

OvertoneInit

Spectral embedding initialization with power-law spectrum S_k ~ k^-0.5.

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"muon_backend_steps":5}

Regularization

logit softcap

parameters: {"value":30}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

LR Schedule

linear warmup

parameters: {"warmup_steps":1}

Novel Contributions

Model Stack-compatible runtime-row packed BitNet export
TrainableBitNetLinear QAT modules
Overtone spectral embedding initialization
MLP hidden dimension 2304 under the 16MB budget
Fused QKV with FlashAttention
Parallel Muon optimization
Dense training backward for grad-input and grad-weight when faster than the compiled step