PR #712

closed

Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v3]

by Shuvam-Banerji-SealView on GitHub

val_bpb

1.4078

Architecture

modded-nanogpt-derived Transformer

Optimizer

—

Artifact Size

15.77 MB

Training Techniques

Quantization

QAT

bits: 6

scope: all

Weight Averaging

SWA

parameters: null

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: null

LR Schedule

LR scheduling tuned for single-device run

parameters: {"gradient_accum_tokens":131000,"iterations":2600}

Other

other

Replaced torch.quantile with w.amax().clamp_min / w.abs().amax(dim=1) to avoid a Triton compiler performance penalty

parameters: null

other

Fixed bigram embeddings validation when size < 2

parameters: null

Tuned hyperparameters down from multi-device scales for a single A100 run
Replaced torch.quantile with amax-based clipping to avoid a severe Triton compiler performance penalty
Adjusted gradient accumulation sizing to 131K tokens so the run completes the intended training iterations
Added validation handling for bigram embeddings when size < 2
Cleaned up unused dependencies/imports and corrected compressor variable logging