PR #469

closed

Non-record: 27M params at Int5 QAT / train larger, quantize harder (val_bpb=1.1418)

by cmcdndView on GitHub

val_bpb

1.1418

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.7 MB

Training Techniques

Quantization

int5

bits: 5

scope: MLP and attention weights

QAT

bits: 5

scope: all

Architecture

Partial RoPE

Uses rotary position embeddings on only part of the dimensions

parameters: {"dimensions":"16/64"}

XSA

Applies XSA in the last 4 layers

parameters: {"layers":4}

SmearGate

Uses SmearGate activation/module

parameters: null

BigramHash

Adds BigramHash feature module

parameters: {"size":4096,"dim":128}

MLP3x

Uses 3x MLP expansion

parameters: {"hidden":1728}

KV head count

Uses grouped-query attention with fewer KV heads than attention heads

parameters: {"heads":9,"kv_heads":3}

U-Net skips

Uses U-Net style skip connections

parameters: null

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: null

Weight Averaging

SWA

parameters: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Regularization

weight decay

parameters: {"weight_decay":0.04}

Other

other

Early activation of int5 STE fake-quantization when lr_scale < 0.50, giving about 1,700 adaptation steps

parameters: {"threshold":0.5,"adaptation_steps":1700}

Novel Contributions

Train a larger 27M-parameter model at the same artifact budget by using more aggressive int5 quantization instead of int6.
Activate QAT much earlier (threshold 0.50) to allow substantially more adaptation time for the coarser 32-level quantization grid.
Demonstrate that training larger and quantizing harder can outperform the standard smaller int6 approach at similar artifact size.