PR #117

open

submission: Int6 MLP3x + QAT + SlidingWindow (val_bpb: 1.1702)

by trovatochrisView on GitHub
val_bpb
1.1702
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,306,777 bytes

Training Techniques

Quantization
int6
bits: 6
scope: per-row weights
QAT
bits: null
scope: weights
Architecture
MLP3x
Expanded MLP width by 3x.
parameters: {"multiplier":3}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_steps":1500,"momentum_warmup_start":0.92}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Other
other
QAT weight-snapping started at 70% of training.
parameters: {"qat_start_frac":0.7}

Novel Contributions

  • Stacked int6 per-row quantization with zstd22 compression
  • 3x MLP expansion
  • QAT weight-snapping starting at 70% of training
  • Muon optimizer tuning with momentum warmup
  • Extended warmdown schedule
  • Stride-64 sliding window evaluation