PR #707

open

Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v2]

by Shuvam-Banerji-SealView on GitHub

val_bpb

1.4078

Architecture

modded-nanogpt-derived Transformer

Optimizer

—

Artifact Size

15.77 MB

Training Techniques

Quantization

QAT

bits: 6

scope: all

Weight Averaging

SWA

parameters: null

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: null

LR Schedule

LR scheduling tuned for single-device run

parameters: {"gradient_accum_tokens":131000,"iterations":2600}

Other

other

Replaced torch.quantile with w.abs().amax(dim=1) / w.amax().clamp_min-based clip factor estimation to avoid Triton compiler performance penalty

parameters: null

Adjusted hyperparameters from multi-device scales to single-A100 scales so the LR schedule completes properly
Replaced torch.quantile with amax-based clip factor estimation to avoid a severe Triton compiler performance penalty
Reduced gradient accumulation sizing to 131K tokens so training reaches 2600 iterations within the time budget
Addressed prior review feedback on unused dependencies and imports