PR #670
openNon-record: Negative results — hardware alignment & quantization on 8xH100
by abaybektursunView on GitHub
val_bpb
1.1171
Architecture
11L d=512 Transformer
Optimizer
Parallel Muon
Artifact Size
16MB
Training Techniques
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Quantization
GPTQ
bits: null
scope: all
SpinQuant/Hadamard
bits: null
scope: all
mixed int5/int8
bits: 5
scope: per-layer
STE QAT
bits: null
scope: all
Soft-Round QAT
bits: null
scope: all
selective pruning
bits: null
scope: all
Architecture
XSA
Applied XSA to all 11 layers instead of only the last 4 layers
parameters: {"layers":11}
VRL
Value Residual Learning to inject identity information into deep attention layers
parameters: null
Gated Attention
Per-head sigmoid gating in attention
parameters: null
QKV fusion
Fused 8Q/4KV grouped-query attention projection
parameters: {"q_heads":8,"kv_heads":4}
Regularization
weight decay
parameters: {"weight_decay":0.08}
weight decay
parameters: {"weight_decay":0.04}
Test-Time Training
score-first TTT
parameters: {"experiments":22}
Other
other
torch.compile-based kernel fusion and hardware-aligned optimization attempts including CUTLASS SM90, fused Triton GEMM, FP8 training, custom CUDA, fused norm+residual, and stale-process mitigation
parameters: null
Novel Contributions
- Systematic negative-results study of 30+ optimization experiments on an 8xH100 setup
- Demonstration that torch.compile (PyTorch 2.9.1) already fuses most relevant patterns
- Evidence that cuBLAS is near the hardware limit for K=512 in this setting
- Finding that quantization quality matters more than kernel engineering for this competition
- Comparison of SpinQuant/Hadamard, mixed int5/int8, Soft-Round QAT, and selective pruning
- Evaluation of architecture changes such as XSA-all, VRL, Gated Attention, larger models, batch size changes, and shard ordering
- Observation that stale nohup+torchrun processes can silently degrade performance