PR #69

open

SubSixteen v2: Int6 QAT + MLP 3x + SWA + Sliding Window (val_bpb 1.1708)

by TevBenjiView on GitHub

val_bpb

1.1708

Architecture

GPT

Optimizer

Muon

Artifact Size

14,603,588 bytes

Training Techniques

Quantization

STE QAT

bits: 6

scope: block weights

Architecture

MLP3x

Expanded MLP hidden size to 1536 (3x expansion) and reduced depth to fit under the artifact limit.

parameters: {"layers":9,"hidden_dim":1536,"vocab_size":1024,"dim":512,"gqa_heads":8,"kv_heads":4}

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

RoPE

Uses NTK-aware RoPE.

parameters: null

Weight Averaging

SWA

parameters: {"checkpoints":16,"interval_steps":200}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":4096}

Initialization

spectral init

Overtone SVD initialization with power-law shaping.

resid mix

Phase-transition resid_mix initialization.

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"lr":0.02,"orthogonalization":"Newton-Schulz"}

AdamW

weight_decay: null

momentum: null

other_params: {"lr_embeddings":0.03,"lr_scalars":0.02}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000,"momentum_warmup_steps":1500,"momentum_start":0.92,"momentum_end":0.99}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

Regularization

weight decay

parameters: {"decoupled":true}

Other

other

Straight-through estimator fake quantization during forward pass to improve post-training int6 robustness.

parameters: {"quant_range":[-31,31]}

Novel Contributions

STE fake-int6 QAT for quantization-aware training
MLP 3x expansion enabled by int6 artifact savings
Stochastic Weight Averaging over 16 checkpoints
zstd-22 compression for the final artifact
Sliding window evaluation with stride 64 and context length 4096
Muon optimizer with Newton-Schulz orthogonalization