PR #137

open

Record: Int6 + MLP 3x + STE QAT + NorMuon + sliding window (val_bpb 1.1666)

by abhishekgahlot2View on GitHub
val_bpb
1.1666
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
15.22 MB

Training Techniques

Quantization
STE QAT
bits: 6
scope: MLP and attention weights; fp16 passthrough for tied embedding
Architecture
MLP3x
Expanded MLP hidden size to 1536 (3x expansion) to increase model capacity.
parameters: {"hidden":1536,"mlp_mult":3}
Optimizer
NorMuon
weight_decay: 0.01
momentum: 0.99
other_params: {"matrix_lr":0.02,"grad_clip":0.3,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
SWA
parameters: {"checkpoint_interval_steps":200,"warmdown_iters":3000}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"weight_decay":0.01,"grad_clip_norm":0.3,"logit_softcap":15}
Other
other
Mixed quantization with int6 per-row on MLP and attention weights, fp16 passthrough for tied embedding, and QAT using a straight-through estimator.
parameters: {"enable_qat":1,"ema_decay":0.998}

Novel Contributions

  • Int6 mixed quantization with STE fake-int6 QAT
  • 3x MLP expansion to increase capacity under artifact size constraints
  • NorMuon optimizer with row-wise RMS normalization after Newton-Schulz orthogonalization
  • SWA checkpoint averaging during warmdown
  • Sliding window evaluation with stride 64 for improved val_bpb
  • Mixed quantization scheme with int6 per-row weights and fp16 tied embedding passthrough