PR #1192

open

Non-record: Fused Triton Megakernels — RMSNorm + LeakyReLU² (val_bpb 1.3560)

by dentity007View on GitHub
val_bpb
1.3560
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU(0.75) squared as an MLP activation.
parameters: {"slope":0.75}
Other
other
Fused Triton megakernels for RMSNorm and LeakyReLU(0.75)^2 used during evaluation to speed up inference and gain extra training steps within the wallclock budget.
parameters: null
Evaluation
Triton eval kernels
parameters: null

Novel Contributions

  • Fused Triton kernels for RMSNorm and LeakyReLU(0.75)^2
  • Evaluation-time speedup to fit more training steps within the fixed wallclock budget
  • PyTorch fallback path with MEGAKERNEL_ENABLED=0
  • autograd.Function wrappers for future training-time kernel use
  • Implementation of the requested Megakernels research direction