PR #1032

open

[Non-Record] QAT Dead-Code Analysis + 7 Novel Technique Sweep (1xH100)

by wfprocView on GitHub

val_bpb

1.3631

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Quantization

late QAT

bits: 6

scope: all

STE QAT

bits: 6

scope: all

Regularization

magnitude pruning

parameters: {"fraction":0.05}

LN scale

parameters: {"formula":"1/sqrt(L+1)"}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"Muon-VS"}

AdamW

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Architecture

LeakyReLU

Changed activation slope to 0.75; reported as better than 0.5^2 variant.

parameters: {"slope":0.75}

depth recurrence

Thinking Deeper recurrence applied to the model.

parameters: {"layers":2,"steps":2}

anti-layer removal

Layer ablation diagnostic by zeroing attn_scale and mlp_scale per layer.

parameters: null

MLP width

Wider model variant.

parameters: {"model_dim":576,"layers":11}

Initialization

spectral init

Spectral SVD compression variant using SVD-based factorization.

Novel Contributions

Confirmed a torch.compile dead-code bug causing late QAT to be eliminated in #315-derived code
Implemented a working tensor-scale STE QAT fix that avoids recompilation
Showed that fixing QAT worsened int6 validation bpb on this 1xH100 setup
Swept seven previously untried techniques on the SOTA stack, all negative
Quantified the throughput tax: about 0.007 bpb per 1 ms of overhead at this step budget
Added working prune-then-quantize and anti-layer diagnostic toggles
Observed that zero-overhead changes like LeakyReLU slope tuning are the only ones that survived the budget