PR #117
opensubmission: Int6 MLP3x + QAT + SlidingWindow (val_bpb: 1.1702)
by trovatochrisView on GitHub
val_bpb
1.1702
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,306,777 bytes
Training Techniques
Quantization
int6
bits: 6
scope: per-row weights
QAT
bits: null
scope: weights
Architecture
MLP3x
Expanded MLP width by 3x.
parameters: {"multiplier":3}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_steps":1500,"momentum_warmup_start":0.92}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Other
other
QAT weight-snapping started at 70% of training.
parameters: {"qat_start_frac":0.7}
Novel Contributions
- Stacked int6 per-row quantization with zstd22 compression
- 3x MLP expansion
- QAT weight-snapping starting at 70% of training
- Muon optimizer tuning with momentum warmup
- Extended warmdown schedule
- Stride-64 sliding window evaluation