PR #414

closed

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)

by signalrushView on GitHub
val_bpb
1.1233
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.55 MB

Training Techniques

Quantization
GPTQ-lite
bits: 6
scope: MLP and attention weights
QAT
bits: 6
scope: model weights
int8
bits: 8
scope: embeddings
Architecture
MLP3x
3x MLP expansion with relu-squared activation
parameters: {"expansion":3}
XSA
Efficient Partial XSA on the last 4 layers, GQA-aware and zero-alloc
parameters: {"layers":4}
Partial RoPE
Partial rotary positional embeddings with NTK-aware scaling
parameters: {"dimensions":"16/64"}
SmearGate
Custom gating mechanism used in the model
parameters: null
BigramHash
Bigram hashing feature with 2048 buckets and dim 128
parameters: {"buckets":2048,"dim":128}
tied embeddings
Input and output embeddings are tied
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025,"warmup_momentum":"0.92->0.99 over 1500 steps"}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"lr":0.035,"scope":"embeddings"}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"lr":0.025,"scope":"scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997,"every_step":true}
SWA
parameters: {"frequency":50,"start_condition":"scale<0.2","tight":true}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown3500
parameters: {"warmdown_steps":3500}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections

Novel Contributions

  • GPTQ-lite per-row optimal clip percentile search for int6 quantization
  • EMA weight averaging applied every training step before quantization
  • Longer warmdown schedule (3500 iterations) compared with prior submission
  • Higher late QAT threshold (0.15) to reduce quantization gap
  • Combined post-training optimization and training hyperparameter tuning to achieve a new record