PR #2054

open

Predicted val_bpb ~1.054 on PR #2014 base — Gated XSA + Reverse-Chol GPTQ + Leaky 0.3 stack (code complete, asking for compute to verify)

by anderamondarainh-stackView on GitHub
val_bpb
1.0540
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,999,891 bytes

Training Techniques

Architecture
XSA
Added per-head gated subtraction with tanh(alpha) gating and zero-init alpha for additive behavior at step 0.
parameters: null
LeakyReLU
Tightened LeakyReLU slope from 0.5 to 0.3 in both Python and fused Triton kernel.
parameters: {"slope":0.3}
Quantization
GPTQ
bits: null
scope: calibration Hessians
Other
other
All-rank Hessian averaging via distributed all-reduce before normalization to reduce calibration noise.
parameters: null
other
Reverse-Cholesky computation of Hinv using flipped matrix Cholesky and triangular solve for faster GPTQ inversion.
parameters: null

Novel Contributions

  • Gated XSA with zero-initialized tanh gating on the subtraction path
  • LeakyReLU slope tightened from 0.5 to 0.3 in both Python and Triton kernel
  • Distributed all-rank Hessian averaging for GPTQ calibration
  • Reverse-Cholesky Hinv computation to speed up GPTQ inversion
  • Predicted additive improvement over PR #2014 base