PR #130

open

Non-record: Muon-Aware QAT + LAWA + Adaptive LR Scheduling (7 toggleable improvements)

val_bpb
1.6372
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Quantization
QAT
bits: 8
scope: large matrices (>65K params)
Weight Averaging
LAWA
parameters: {"start_frac":0.8,"checkpoint_interval_steps":200,"avg_last_fraction":0.2}
LR Schedule
warmdown with LR floor and cooldown fraction schedule
parameters: {"lr_floor_fraction":0.1,"cooldown_fraction":0.6,"qat_lr_reduction":0.5,"qat_start_frac":0.75}
Sequence Length
sequence_length
train_length: 256
eval_length: null
Compression
zstd
level: 22
Other
other
Muon-aware QAT with two modes: STE and Gaussian noise, activated late to preserve Muon's momentum subspace
parameters: {"modes":["STE","Gaussian noise"],"qat_start_frac":0.75,"lr_reduction_on_qat_start":0.5}
other
Higher learning rates for matrix, scalar, and tied embedding parameters
parameters: {"matrix_lr":0.06,"scalar_lr":0.06,"tied_embed_lr":0.08}

Novel Contributions

  • Muon-aware QAT designed to reduce quantization noise amplification in Muon's orthogonalized updates
  • Two QAT modes: standard STE and Gaussian noise mode
  • Late-start QAT activation with automatic learning-rate reduction
  • LAWA (Latest Weight Averaging) over late-stage checkpoints
  • Learning-rate floor to avoid freezing into sharp minima
  • Cooldown-fraction-based LR scheduling
  • Sequence length warmup from 256 to 1024 tokens
  • Adaptive artifact compression using zstd or Brotli
  • Higher default learning rates for matrix, scalar, and tied embedding parameters