PR #1679

open

[Non-record] Megakernel Saturation Study: 5 Triton fusion variants cannot beat torch.compile at 27M scale

by ChideraIbe123View on GitHub
val_bpb
0.7625
Architecture
Transformer
Optimizer
Artifact Size
13.85 MB

Training Techniques

Architecture
LeakyReLU
MLP uses LeakyReLU squared activation in the fused architecture variants.
parameters: {"squared":true}
GQA
Grouped query attention with 4 KV heads.
parameters: {"kv_heads":4}
BigramHash
Uses BigramHash in the architecture.
parameters: null
SmearGate
Uses SmearGate in the architecture.
parameters: null
XSA
Uses XSA-all attention modification.
parameters: null
Weight Averaging
EMA + SWA
parameters: null
Quantization
GPTQ
bits: 6
scope: model
late QAT
bits: null
scope: model
Compression
lzma
level: null
Test-Time Training
score-first TTT
parameters: null
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Systematic 5-variant ablation of manual Triton block-level MLP fusion on top of a merged SOTA architecture
  • Negative result showing all fused variants remain within 0.0008 BPB of each other and slightly worse than eager torch.compile
  • Audit-guided best-practices fused kernel variant with epilogue scale, fp32 inv_rms, and GROUP_SIZE_M=8 L2 swizzle
  • Direct comparison against the PR #1450 act_grad-in-forward architecture
  • Argument that torch.compile already provides near-optimal fusion at 27M scale and manual block-level MLP fusion is saturated