PR #1679
open[Non-record] Megakernel Saturation Study: 5 Triton fusion variants cannot beat torch.compile at 27M scale
by ChideraIbe123View on GitHub
val_bpb
0.7625
Architecture
Transformer
Optimizer
—
Artifact Size
13.85 MB
Training Techniques
Architecture
LeakyReLU
MLP uses LeakyReLU squared activation in the fused architecture variants.
parameters: {"squared":true}
GQA
Grouped query attention with 4 KV heads.
parameters: {"kv_heads":4}
BigramHash
Uses BigramHash in the architecture.
parameters: null
SmearGate
Uses SmearGate in the architecture.
parameters: null
XSA
Uses XSA-all attention modification.
parameters: null
Weight Averaging
EMA + SWA
parameters: null
Quantization
GPTQ
bits: 6
scope: model
late QAT
bits: null
scope: model
Compression
lzma
level: null
Test-Time Training
score-first TTT
parameters: null
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- Systematic 5-variant ablation of manual Triton block-level MLP fusion on top of a merged SOTA architecture
- Negative result showing all fused variants remain within 0.0008 BPB of each other and slightly worse than eager torch.compile
- Audit-guided best-practices fused kernel variant with epilogue scale, fp32 inv_rms, and GROUP_SIZE_M=8 L2 swizzle
- Direct comparison against the PR #1450 act_grad-in-forward architecture
- Argument that torch.compile already provides near-optimal fusion at 27M scale and manual block-level MLP fusion is saturated