PR #1105

open

Record: CUTLASS EVT Backward MLP Fusion + Brotli + Turbo-Muon + Memmap

by abaybektursunView on GitHub

val_bpb

1.2208

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

11.51 MB

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU(0.5) squared in the MLP and fuses the up-projection, activation, and square into a single kernel.

parameters: {"negative_slope":0.5}

MLP3x

Uses a 3x MLP expansion in the model architecture.

parameters: {"multiplier":3}

XSA

Applies XSA in all transformer layers.

parameters: {"layers":11}

BigramHash

Uses BigramHash embeddings for token representation.

parameters: {"vocab_size":3072,"dimension":112}

Partial RoPE

Uses partial rotary positional embeddings.

parameters: {"partial":"16/64"}

SmearGate

Includes SmearGate in the architecture.

parameters: null

U-Net skip connections

Uses U-Net style skip connections.

parameters: null

VE128

Uses VE128 layers in the model.

parameters: {"layers":[9,10]}

Weight Averaging

EMA

parameters: null

SWA

parameters: null

Quantization

GPTQ

bits: 6

scope: all

late QAT

bits: 6

scope: all

Compression

brotli

level: 11

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"turbo_variant":true,"aol_preconditioned":true,"iterations":4,"polar_express":true,"ns_variant":"NS4"}

Other

other

Memmap multi-shard data pipeline with coprime-stride sampling, daemon-thread CPU batch building, and CUDA stream double-buffered GPU prefetch.

parameters: {"shards":"multi-shard","prefetch":true}

other

CUTLASS EVT backward MLP fusion using a pingpong warp-specialized schedule and precomputed activation gradients to eliminate intermediate HBM traffic.

parameters: {"kernel":"CUTLASS EVT","schedule":"WarpSpecializedPingpong"}

Novel Contributions

Fused Triton TMA forward MLP kernel that keeps the pre-activation output off HBM
CUTLASS EVT backward MLP fusion with pingpong schedule for faster dpre computation
Pre-computed activation gradient stored in forward pass to remove conditional logic from backward epilogue
Brotli-11 artifact compression replacing LZMA-9
Turbo-Muon / AOL-preconditioned 4-iteration Newton-Schulz optimizer variant
Memmap multi-shard data pipeline with GPU prefetch
Reported same-machine 2xH100 improvement to 1.2208 BPB