PR #712
closedSubmit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v3]
by Shuvam-Banerji-SealView on GitHub
val_bpb
1.4078
Architecture
modded-nanogpt-derived Transformer
Optimizer
—
Artifact Size
15.77 MB
Training Techniques
Quantization
QAT
bits: 6
scope: all
Weight Averaging
SWA
parameters: null
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: null
LR Schedule
LR scheduling tuned for single-device run
parameters: {"gradient_accum_tokens":131000,"iterations":2600}
Other
other
Replaced torch.quantile with w.amax().clamp_min / w.abs().amax(dim=1) to avoid a Triton compiler performance penalty
parameters: null
other
Fixed bigram embeddings validation when size < 2
parameters: null
Novel Contributions
- Tuned hyperparameters down from multi-device scales for a single A100 run
- Replaced torch.quantile with amax-based clipping to avoid a severe Triton compiler performance penalty
- Adjusted gradient accumulation sizing to 131K tokens so the run completes the intended training iterations
- Added validation handling for bigram embeddings when size < 2
- Cleaned up unused dependencies/imports and corrected compressor variable logging