PR #137

open

Record: Int6 + MLP 3x + STE QAT + NorMuon + sliding window (val_bpb 1.1666)

by abhishekgahlot2View on GitHub

val_bpb

1.1666

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

15.22 MB

Training Techniques

Quantization

STE QAT

bits: 6

scope: MLP and attention weights; fp16 passthrough for tied embedding

Architecture

MLP3x

Expanded MLP hidden size to 1536 (3x expansion) to increase model capacity.

parameters: {"hidden":1536,"mlp_mult":3}

Optimizer

NorMuon

weight_decay: 0.01

momentum: 0.99

other_params: {"matrix_lr":0.02,"grad_clip":0.3,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Weight Averaging

SWA

parameters: {"checkpoint_interval_steps":200,"warmdown_iters":3000}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"weight_decay":0.01,"grad_clip_norm":0.3,"logit_softcap":15}

Other

other

Mixed quantization with int6 per-row on MLP and attention weights, fp16 passthrough for tied embedding, and QAT using a straight-through estimator.

parameters: {"enable_qat":1,"ema_decay":0.998}

Novel Contributions

Int6 mixed quantization with STE fake-int6 QAT
3x MLP expansion to increase capacity under artifact size constraints
NorMuon optimizer with row-wise RMS normalization after Newton-Schulz orthogonalization
SWA checkpoint averaging during warmdown
Sliding window evaluation with stride 64 for improved val_bpb
Mixed quantization scheme with int6 per-row weights and fp16 tied embedding passthrough