PR #725

open

Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v5]

by Shuvam-Banerji-SealView on GitHub

val_bpb

1.5252

Architecture

modded-nanogpt-derived Transformer

Optimizer

—

Artifact Size

15.77 MB

Training Techniques

Quantization

QAT

bits: 6

scope: all

Architecture

CastedLinear clip factor estimator

Replaces torch.quantile with w.abs().amax(dim=1).clamp_min for faster clip factor estimation and to avoid Triton compilation slowdown.

parameters: null

bigram embedding guard

Adds a guard for small-vocab edge cases in the bigram embedding path.

parameters: null

Other

other

Makes compressor-dependent labels and final-roundtrip labels explicit in training logs.

parameters: null

Sequence Length

sequence_length

train_length: 131000

eval_length: null

Weight Averaging

SWA

parameters: null

Evaluation

sliding window eval

parameters: null

Compression

zstd

level: null

Single-device A100 tuning of QAT hyperparameters to fit within the wallclock cap
Replaced torch.quantile with w.abs().amax(dim=1).clamp_min to avoid a large Triton compilation slowdown
Added a guard for small-vocab bigram embedding edge cases
Made compressor-dependent and final-roundtrip labels explicit in training logs
Reported final submission metric from post-export sliding-window roundtrip evaluation