PR #719

closed

Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v4]

by Shuvam-Banerji-SealView on GitHub
val_bpb
1.5252
Architecture
modded-nanogpt-derived Transformer
Optimizer
Artifact Size
15.77 MB

Training Techniques

Quantization
QAT
bits: 6
scope: all
Evaluation
sliding window eval
parameters: null
Weight Averaging
SWA
parameters: null
Compression
zstd
level: null
Sequence Length
sequence_length
train_length: 131000
eval_length: null
LR Schedule
standard LR scheduling tuned for single-device run
parameters: {"scaled_down_from_multi_device":true}
Other
other
Replaced torch.quantile with w.abs().amax(dim=1).clamp_min to avoid Triton compilation slowdown
parameters: null
other
Made compressor-dependent labels and final-roundtrip labels explicit in training logs
parameters: null
Architecture
bigram embedding guard
Added guard for small-vocab edge cases in the bigram embedding path
parameters: null

Novel Contributions

  • Tuned hyperparameters down from multi-device scales for a single A100 run to preserve proper LR scheduling
  • Replaced torch.quantile with w.abs().amax(dim=1).clamp_min to avoid a large Triton compilation slowdown
  • Added a guard for small-vocab edge cases in the bigram embedding path
  • Made compressor-dependent labels and final-roundtrip labels explicit in training logs
  • Used final post-export sliding-window roundtrip metric as the reported submission val_bpb