PR #1032

open

[Non-Record] QAT Dead-Code Analysis + 7 Novel Technique Sweep (1xH100)

val_bpb
1.3631
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Quantization
late QAT
bits: 6
scope: all
STE QAT
bits: 6
scope: all
Regularization
magnitude pruning
parameters: {"fraction":0.05}
LN scale
parameters: {"formula":"1/sqrt(L+1)"}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"Muon-VS"}
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Architecture
LeakyReLU
Changed activation slope to 0.75; reported as better than 0.5^2 variant.
parameters: {"slope":0.75}
depth recurrence
Thinking Deeper recurrence applied to the model.
parameters: {"layers":2,"steps":2}
anti-layer removal
Layer ablation diagnostic by zeroing attn_scale and mlp_scale per layer.
parameters: null
MLP width
Wider model variant.
parameters: {"model_dim":576,"layers":11}
Initialization
spectral init
Spectral SVD compression variant using SVD-based factorization.

Novel Contributions

  • Confirmed a torch.compile dead-code bug causing late QAT to be eliminated in #315-derived code
  • Implemented a working tensor-scale STE QAT fix that avoids recompilation
  • Showed that fixing QAT worsened int6 validation bpb on this 1xH100 setup
  • Swept seven previously untried techniques on the SOTA stack, all negative
  • Quantified the throughput tax: about 0.007 bpb per 1 ms of overhead at this step budget
  • Added working prune-then-quantize and anti-layer diagnostic toggles
  • Observed that zero-overhead changes like LeakyReLU slope tuning are the only ones that survived the budget