PR #719

closed

Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v4]

by Shuvam-Banerji-SealView on GitHub

val_bpb

1.5252

Architecture

modded-nanogpt-derived Transformer

Optimizer

—

Artifact Size

15.77 MB

Training Techniques

Quantization

QAT

bits: 6

scope: all

Evaluation

sliding window eval

parameters: null

Weight Averaging

SWA

parameters: null

Compression

zstd

level: null

Sequence Length

sequence_length

train_length: 131000

eval_length: null

LR Schedule

standard LR scheduling tuned for single-device run

parameters: {"scaled_down_from_multi_device":true}

Other

other

Replaced torch.quantile with w.abs().amax(dim=1).clamp_min to avoid Triton compilation slowdown

parameters: null

other

Made compressor-dependent labels and final-roundtrip labels explicit in training logs

parameters: null

Architecture

bigram embedding guard

Added guard for small-vocab edge cases in the bigram embedding path

parameters: null

Tuned hyperparameters down from multi-device scales for a single A100 run to preserve proper LR scheduling
Replaced torch.quantile with w.abs().amax(dim=1).clamp_min to avoid a large Triton compilation slowdown
Added a guard for small-vocab edge cases in the bigram embedding path
Made compressor-dependent labels and final-roundtrip labels explicit in training logs
Used final post-export sliding-window roundtrip metric as the reported submission val_bpb