PR #69

open

SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708)

by TevBenjiView on GitHub
val_bpb
1.1708
Architecture
GPT
Optimizer
Muon
Artifact Size
14,603,588 bytes

Training Techniques

Quantization
STE QAT
bits: 6
scope: block weights
Architecture
MLP3x
Expanded MLP hidden size to 1536 (3x expansion) and reduced depth to fit under the artifact limit.
parameters: {"layers":9,"hidden_dim":1536,"vocab_size":1024,"dim":512,"gqa_heads":8,"kv_heads":4}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
RoPE
Uses NTK-aware RoPE.
parameters: null
Weight Averaging
SWA
parameters: {"checkpoints":16,"interval_steps":200}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":4096}
Initialization
spectral init
Overtone SVD initialization with power-law shaping.
resid mix
Phase-transition resid_mix initialization.
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"lr":0.02,"orthogonalization":"Newton-Schulz"}
AdamW
weight_decay: null
momentum: null
other_params: {"lr_embeddings":0.03,"lr_scalars":0.02}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000,"momentum_warmup_steps":1500,"momentum_start":0.92,"momentum_end":0.99}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Regularization
weight decay
parameters: {"decoupled":true}
Other
other
Straight-through estimator fake quantization during forward pass to improve post-training int6 robustness.
parameters: {"quant_range":[-31,31]}

Novel Contributions

  • STE fake-int6 QAT for quantization-aware training
  • MLP 3x expansion enabled by int6 artifact savings
  • Stochastic Weight Averaging over 16 checkpoints
  • zstd-22 compression for the final artifact
  • Sliding window evaluation with stride 64 and context length 4096
  • Muon optimizer with Newton-Schulz orthogonalization