PR #670

open

Non-record: Negative results — hardware alignment & quantization on 8xH100

by abaybektursunView on GitHub
val_bpb
1.1171
Architecture
11L d=512 Transformer
Optimizer
Parallel Muon
Artifact Size
16MB

Training Techniques

Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Quantization
GPTQ
bits: null
scope: all
SpinQuant/Hadamard
bits: null
scope: all
mixed int5/int8
bits: 5
scope: per-layer
STE QAT
bits: null
scope: all
Soft-Round QAT
bits: null
scope: all
selective pruning
bits: null
scope: all
Architecture
XSA
Applied XSA to all 11 layers instead of only the last 4 layers
parameters: {"layers":11}
VRL
Value Residual Learning to inject identity information into deep attention layers
parameters: null
Gated Attention
Per-head sigmoid gating in attention
parameters: null
QKV fusion
Fused 8Q/4KV grouped-query attention projection
parameters: {"q_heads":8,"kv_heads":4}
Regularization
weight decay
parameters: {"weight_decay":0.08}
weight decay
parameters: {"weight_decay":0.04}
Test-Time Training
score-first TTT
parameters: {"experiments":22}
Other
other
torch.compile-based kernel fusion and hardware-aligned optimization attempts including CUTLASS SM90, fused Triton GEMM, FP8 training, custom CUDA, fused norm+residual, and stale-process mitigation
parameters: null

Novel Contributions

  • Systematic negative-results study of 30+ optimization experiments on an 8xH100 setup
  • Demonstration that torch.compile (PyTorch 2.9.1) already fuses most relevant patterns
  • Evidence that cuBLAS is near the hardware limit for K=512 in this setting
  • Finding that quantization quality matters more than kernel engineering for this competition
  • Comparison of SpinQuant/Hadamard, mixed int5/int8, Soft-Round QAT, and selective pruning
  • Evaluation of architecture changes such as XSA-all, VRL, Gated Attention, larger models, batch size changes, and shard ordering
  • Observation that stale nohup+torchrun processes can silently degrade performance