val_bpb
1.1352
Architecture
GPT
Optimizer
Parallel Muon
Artifact Size
15.44 MB
Training Techniques
Quantization
STE QAT
bits: 6
scope: all bank parameters
Architecture
XSA
Expanded XSA from the last 4 layers to the last 5 layers.
parameters: {"layers":5}
Regularization
label smoothing
parameters: {"value":0.05}
Test-Time Training
full TTT
parameters: {"learning_rate":0.003,"momentum":0.95,"epochs":3,"chunk_tokens":32768}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Other
other
Bank QAT fix implemented directly in GPT.forward() using STE int6 fake-quantization for bank parameters, with torch.compile reset/recompile.
parameters: {"recompile_cost_seconds":50,"overhead_ms_per_step":5}
Novel Contributions
- Fixed broken Bank QAT by implementing STE int6 fake-quantization directly in GPT.forward() for bank parameters.
- Expanded XSA from 4 layers to 5 layers.
- Added label smoothing of 0.05.
- Tuned TTT hyperparameters to learning rate 0.003 and momentum 0.95.
- Reported that the QAT fix was too expensive due to recompilation overhead and reduced training steps.