PR #344

open

Non-record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100)

by aryanbhosaleView on GitHub
val_bpb
1.1330
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
MLP3.5x
Expanded MLP hidden size to 3.5x with hidden=1792.
parameters: {"hidden":1792}
LeakyReLU
Uses LeakyReLU(0.5)^2 activation in the MLP.
parameters: {"slope":0.5,"power":2}
SmearGate
Adds SmearGate mechanism.
parameters: null
BigramHash
Adds bigram hash features.
parameters: {"size":10240,"dim":128}
TrigramHash
Adds trigram hash features.
parameters: {"size":4096,"dim":128}
Value Residual
Caches V from layer 0 and blends it via learned lambda (ResFormer-style).
parameters: null
Gated Attention
Per-head sigmoid gating for attention outputs.
parameters: null
XSA
Exclusive self-attention applied to all 11 layers.
parameters: {"layers":11}
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
tied embeddings
Input and output embeddings are tied.
parameters: null
U-Net skip connections
Adds skip connections in a U-Net-like pattern.
parameters: null
Initialization
OrthoInit
Orthogonal initialization.
Optimizer
Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"momentum_schedule_end":0.99,"momentum_schedule_steps":1500,"lr":0.03}
Adam
weight_decay: null
momentum: null
other_params: {"lr":0.035,"scope":"embeddings"}
Adam
weight_decay: null
momentum: null
other_params: {"lr":0.03,"scope":"scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
int6
bits: 6
scope: per-row weights
GPTQ-lite
bits: null
scope: per-row weights
STE QAT
bits: null
scope: final 15% of training
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
gradient clipping
parameters: {"clip_norm":0.3}

Novel Contributions

  • 11-layer Transformer with 3.5x MLP expansion and LeakyReLU(0.5)^2 activation
  • SmearGate, BigramHash, and TrigramHash feature augmentations
  • Value Residual (ResFormer-style) and Gated Attention
  • XSA applied to all 11 layers
  • Partial RoPE on 16/64 head dimensions
  • Late QAT via STE during the final 15% of training
  • Int6 uniform per-row quantization with GPTQ-lite and zstd compression