PR #527

open

Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record)

by Shuvam-Banerji-SealView on GitHub

val_bpb

1.4078

Architecture

Transformer

Optimizer

—

Artifact Size

15.77 MB

Training Techniques

Quantization

QAT

bits: 6

scope: null

Weight Averaging

SWA

parameters: null

Evaluation

sliding window eval

parameters: null

LR Schedule

custom tuning from multi-device to single-device scale

parameters: null

Other

other

Replaced torch.quantile with w.abs().amax(dim=1).clamp_min to avoid 30x compiler performance penalty in Triton

parameters: null

Tuned hyperparameters from multi-device to single-device (1x A100) scale to ensure proper learning rate scheduling and fit within 10-minute training constraint
Replaced torch.quantile with w.abs().amax(dim=1).clamp_min in CastedLinear to bypass a severe 30x GPU performance penalty caused by Triton compiler
Constrained gradient accumulation size to 131K tokens to allow 2600 descending iterations and proper LR decay schedule
Used int6 quantization-aware training (QAT) to reduce artifact size while maintaining performance
Graceful termination into SWA and final sliding-window evaluation