PR #344

open

Non-record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100)

by aryanbhosaleView on GitHub

val_bpb

1.1330

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

MLP3.5x

Expanded MLP hidden size to 3.5x with hidden=1792.

parameters: {"hidden":1792}

LeakyReLU

Uses LeakyReLU(0.5)^2 activation in the MLP.

parameters: {"slope":0.5,"power":2}

SmearGate

Adds SmearGate mechanism.

parameters: null

BigramHash

Adds bigram hash features.

parameters: {"size":10240,"dim":128}

TrigramHash

Adds trigram hash features.

parameters: {"size":4096,"dim":128}

Value Residual

Caches V from layer 0 and blends it via learned lambda (ResFormer-style).

parameters: null

Gated Attention

Per-head sigmoid gating for attention outputs.

parameters: null

XSA

Exclusive self-attention applied to all 11 layers.

parameters: {"layers":11}

Partial RoPE

Applies rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":"16/64"}

tied embeddings

Input and output embeddings are tied.

parameters: null

U-Net skip connections

Adds skip connections in a U-Net-like pattern.

parameters: null

Initialization

OrthoInit

Orthogonal initialization.

Optimizer

Muon

weight_decay: 0.04

momentum: 0.92

other_params: {"momentum_schedule_end":0.99,"momentum_schedule_steps":1500,"lr":0.03}

Adam

weight_decay: null

momentum: null

other_params: {"lr":0.035,"scope":"embeddings"}

Adam

weight_decay: null

momentum: null

other_params: {"lr":0.03,"scope":"scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

int6

bits: 6

scope: per-row weights

GPTQ-lite

bits: null

scope: per-row weights

STE QAT

bits: null

scope: final 15% of training

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

gradient clipping

parameters: {"clip_norm":0.3}

Novel Contributions

11-layer Transformer with 3.5x MLP expansion and LeakyReLU(0.5)^2 activation
SmearGate, BigramHash, and TrigramHash feature augmentations
Value Residual (ResFormer-style) and Gated Attention
XSA applied to all 11 layers
Partial RoPE on 16/64 head dimensions
Late QAT via STE during the final 15% of training
Int6 uniform per-row quantization with GPTQ-lite and zstd compression