PR #667

open

Non-record: Fixed Bank QAT + XSA5 + Label Smoothing (1.1352)

by suchitj2702View on GitHub
val_bpb
1.1352
Architecture
GPT
Optimizer
Parallel Muon
Artifact Size
15.44 MB

Training Techniques

Quantization
STE QAT
bits: 6
scope: all bank parameters
Architecture
XSA
Expanded XSA from the last 4 layers to the last 5 layers.
parameters: {"layers":5}
Regularization
label smoothing
parameters: {"value":0.05}
Test-Time Training
full TTT
parameters: {"learning_rate":0.003,"momentum":0.95,"epochs":3,"chunk_tokens":32768}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Other
other
Bank QAT fix implemented directly in GPT.forward() using STE int6 fake-quantization for bank parameters, with torch.compile reset/recompile.
parameters: {"recompile_cost_seconds":50,"overhead_ms_per_step":5}

Novel Contributions

  • Fixed broken Bank QAT by implementing STE int6 fake-quantization directly in GPT.forward() for bank parameters.
  • Expanded XSA from 4 layers to 5 layers.
  • Added label smoothing of 0.05.
  • Tuned TTT hyperparameters to learning rate 0.003 and momentum 0.95.
  • Reported that the QAT fix was too expensive due to recompilation overhead and reduced training steps.