PR #1192

open

Non-record: Fused Triton Megakernels — RMSNorm + LeakyReLU² (val_bpb 1.3560)

by dentity007View on GitHub

val_bpb

1.3560

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU(0.75) squared as an MLP activation.

parameters: {"slope":0.75}

Other

other

Fused Triton megakernels for RMSNorm and LeakyReLU(0.75)^2 used during evaluation to speed up inference and gain extra training steps within the wallclock budget.

parameters: null

Evaluation

Triton eval kernels

parameters: null

Novel Contributions

Fused Triton kernels for RMSNorm and LeakyReLU(0.75)^2
Evaluation-time speedup to fit more training steps within the fixed wallclock budget
PyTorch fallback path with MEGAKERNEL_ENABLED=0
autograd.Function wrappers for future training-time kernel use
Implementation of the requested Megakernels research direction