PR #1105
openRecord: CUTLASS EVT Backward MLP Fusion + Brotli + Turbo-Muon + Memmap
by abaybektursunView on GitHub
val_bpb
1.2208
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
11.51 MB
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU(0.5) squared in the MLP and fuses the up-projection, activation, and square into a single kernel.
parameters: {"negative_slope":0.5}
MLP3x
Uses a 3x MLP expansion in the model architecture.
parameters: {"multiplier":3}
XSA
Applies XSA in all transformer layers.
parameters: {"layers":11}
BigramHash
Uses BigramHash embeddings for token representation.
parameters: {"vocab_size":3072,"dimension":112}
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"partial":"16/64"}
SmearGate
Includes SmearGate in the architecture.
parameters: null
U-Net skip connections
Uses U-Net style skip connections.
parameters: null
VE128
Uses VE128 layers in the model.
parameters: {"layers":[9,10]}
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Quantization
GPTQ
bits: 6
scope: all
late QAT
bits: 6
scope: all
Compression
brotli
level: 11
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"turbo_variant":true,"aol_preconditioned":true,"iterations":4,"polar_express":true,"ns_variant":"NS4"}
Other
other
Memmap multi-shard data pipeline with coprime-stride sampling, daemon-thread CPU batch building, and CUDA stream double-buffered GPU prefetch.
parameters: {"shards":"multi-shard","prefetch":true}
other
CUTLASS EVT backward MLP fusion using a pingpong warp-specialized schedule and precomputed activation gradients to eliminate intermediate HBM traffic.
parameters: {"kernel":"CUTLASS EVT","schedule":"WarpSpecializedPingpong"}
Novel Contributions
- Fused Triton TMA forward MLP kernel that keeps the pre-activation output off HBM
- CUTLASS EVT backward MLP fusion with pingpong schedule for faster dpre computation
- Pre-computed activation gradient stored in forward pass to remove conditional logic from backward epilogue
- Brotli-11 artifact compression replacing LZMA-9
- Turbo-Muon / AOL-preconditioned 4-iteration Newton-Schulz optimizer variant
- Memmap multi-shard data pipeline with GPU prefetch
- Reported same-machine 2xH100 improvement to 1.2208 BPB