PR #116
closedRecord: Int6 + MLP 3x + STE QAT + NorMuon + sliding window (val_bpb 1.1666)
by abhishekgahlot2View on GitHub
val_bpb
1.1666
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
15.22 MB
Training Techniques
Quantization
STE QAT
bits: 6
scope: MLP and attention weights; fp16 passthrough for tied embedding and small/control tensors
Architecture
MLP3x
Expanded MLP hidden size to 1536 (3x expansion) to increase model capacity.
parameters: {"hidden_size":1536,"mlp_mult":3}
Optimizer
NorMuon
weight_decay: 0.01
momentum: 0.99
other_params: {"matrix_lr":0.02,"grad_clip":0.3,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
SWA
parameters: {"checkpoint_interval_steps":200,"warmdown_iters":3000}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"value":0.01}
Other
other
Logit softcap applied during training/evaluation.
parameters: {"logit_softcap":15}
Novel Contributions
- STE fake-int6 QAT throughout training
- 3x MLP expansion to increase capacity under int6 constraints
- NorMuon optimizer with row-wise RMS normalization after Newton-Schulz orthogonalization
- SWA checkpoint averaging during warmdown
- Mixed quantization with int6 per-row on MLP and attention weights and fp16 passthrough for tied embeddings
- Sliding window evaluation with stride 64