PR #670

open

Non-record: Negative results — hardware alignment & quantization on 8xH100

by abaybektursunView on GitHub

val_bpb

1.1171

Architecture

11L d=512 Transformer

Optimizer

Parallel Muon

Artifact Size

16MB

Training Techniques

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Quantization

GPTQ

bits: null

scope: all

SpinQuant/Hadamard

bits: null

scope: all

mixed int5/int8

bits: 5

scope: per-layer

STE QAT

bits: null

scope: all

Soft-Round QAT

bits: null

scope: all

selective pruning

bits: null

scope: all

Architecture

XSA

Applied XSA to all 11 layers instead of only the last 4 layers

parameters: {"layers":11}

VRL

Value Residual Learning to inject identity information into deep attention layers

parameters: null

Gated Attention

Per-head sigmoid gating in attention

parameters: null

QKV fusion

Fused 8Q/4KV grouped-query attention projection

parameters: {"q_heads":8,"kv_heads":4}

Regularization

weight decay

parameters: {"weight_decay":0.08}

weight decay

parameters: {"weight_decay":0.04}

Test-Time Training

score-first TTT

parameters: {"experiments":22}

Other

other

torch.compile-based kernel fusion and hardware-aligned optimization attempts including CUTLASS SM90, fused Triton GEMM, FP8 training, custom CUDA, fused norm+residual, and stale-process mitigation

parameters: null

Novel Contributions

Systematic negative-results study of 30+ optimization experiments on an 8xH100 setup
Demonstration that torch.compile (PyTorch 2.9.1) already fuses most relevant patterns
Evidence that cuBLAS is near the hardware limit for K=512 in this setting
Finding that quantization quality matters more than kernel engineering for this competition
Comparison of SpinQuant/Hadamard, mixed int5/int8, Soft-Round QAT, and selective pruning
Evaluation of architecture changes such as XSA-all, VRL, Gated Attention, larger models, batch size changes, and shard ordering
Observation that stale nohup+torchrun processes can silently degrade performance