PR #709

open

Results of 2026-03-23_MixedQAT_Int5MLP_Int6Attn

by StolbaJView on GitHub

val_bpb

1.1478

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,957,281 bytes

Training Techniques

Quantization

STE QAT

bits: 5

scope: MLP

STE QAT

bits: 6

scope: attention and bigram-proj

Architecture

SmearGate

Learned previous-token blending at the embedding layer

parameters: null

BigramHash

Hash-based bigram embedding table

parameters: {"dimensions":128,"table_size":10240}

MLP3x

MLP with 3x expansion

parameters: {"hidden_size":1536}

KV head count

Grouped-query attention with 4 KV heads and 8 attention heads

parameters: {"heads":8,"kv_heads":4,"layers":10,"dim":512}

tied embeddings

Input and output embeddings are tied

parameters: {"vocab_size":1024}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02,"warmup_momentum_start":0.92,"warmup_steps":1500}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"scalar_lr":0.02,"tied_embed_lr":0.03}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every_steps":50,"checkpoints_averaged":24}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Initialization

OrthoInit

Orthogonal weight initialization with muP output-projection scaling

Regularization

weight decay

parameters: {"muon_wd":0.04,"adamw_wd":0.04}

Novel Contributions

Mixed-precision QAT with int5 STE for MLP and int6 STE for attention/bigram projection
STE quantization aligned exactly with the export-time per-row quantization scheme
QAT enabled from step 0 on the full SOTA stack
Combination of QAT with the existing SOTA architecture features such as SmearGate and BigramHash