val_bpb
1.4078
Architecture
Transformer
Optimizer
—
Artifact Size
15.77 MB
Training Techniques
Quantization
QAT
bits: 6
scope: null
Weight Averaging
SWA
parameters: null
Evaluation
sliding window eval
parameters: null
LR Schedule
custom tuning from multi-device to single-device scale
parameters: null
Other
other
Replaced torch.quantile with w.abs().amax(dim=1).clamp_min to avoid 30x compiler performance penalty in Triton
parameters: null
Novel Contributions
- Tuned hyperparameters from multi-device to single-device (1x A100) scale to ensure proper learning rate scheduling and fit within 10-minute training constraint
- Replaced torch.quantile with w.abs().amax(dim=1).clamp_min in CastedLinear to bypass a severe 30x GPU performance penalty caused by Triton compiler
- Constrained gradient accumulation size to 131K tokens to allow 2600 descending iterations and proper LR decay schedule
- Used int6 quantization-aware training (QAT) to reduce artifact size while maintaining performance
- Graceful termination into SWA and final sliding-window evaluation