PR #130
openNon-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling (7 toggleable improvements)
by mohosyView on GitHub
val_bpb
1.6372
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Quantization
QAT
bits: 8
scope: large matrices (>65K params)
Weight Averaging
LAWA
parameters: {"start_frac":0.8,"checkpoint_interval_steps":200,"avg_last_fraction":0.2}
LR Schedule
warmdown with LR floor and cooldown fraction schedule
parameters: {"lr_floor_fraction":0.1,"cooldown_fraction":0.6,"qat_lr_reduction":0.5,"qat_start_frac":0.75}
Sequence Length
sequence_length
train_length: 256
eval_length: null
Compression
zstd
level: 22
Other
other
Muon-aware QAT with two modes: STE and Gaussian noise, activated late to preserve Muon's momentum subspace
parameters: {"modes":["STE","Gaussian noise"],"qat_start_frac":0.75,"lr_reduction_on_qat_start":0.5}
other
Higher learning rates for matrix, scalar, and tied embedding parameters
parameters: {"matrix_lr":0.06,"scalar_lr":0.06,"tied_embed_lr":0.08}
Novel Contributions
- Muon-aware QAT designed to reduce quantization noise amplification in Muon's orthogonalized updates
- Two QAT modes: standard STE and Gaussian noise mode
- Late-start QAT activation with automatic learning-rate reduction
- LAWA (Latest Weight Averaging) over late-stage checkpoints
- Learning-rate floor to avoid freezing into sharp minima
- Cooldown-fraction-based LR scheduling
- Sequence length warmup from 256 to 1024 tokens
- Adaptive artifact compression using zstd or Brotli
- Higher default learning rates for matrix, scalar, and tied embedding parameters