PR #635

open

Non-record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330, 8xH100 SXM)

by aryanbhosaleView on GitHub

val_bpb

1.1330

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Quantization

int6 uniform + GPTQ-lite

bits: 6

scope: all except tied embeddings

Architecture

MLP 3.5x with LeakyReLU(0.5)^2

Expanded MLP hidden dimension with squared LeakyReLU activation

parameters: {"expansion_factor":3.5,"activation":"LeakyReLU(0.5)^2","hidden_dim":1792}

SmearGate

Gating mechanism applied in architecture

parameters: null

BigramHash

Bigram hashing with 10240 buckets and 128 dimensions

parameters: {"buckets":10240,"dim":128}

TrigramHash

Trigram hashing with 4096 buckets and 128 dimensions

parameters: {"buckets":4096,"dim":128}

Value Residual (ResFormer)

Caching and blending value vectors from layer 0 via learned lambda

parameters: null

Gated Attention

Per-head sigmoid gating with bias initialized to 4.0

parameters: null

XSA all 11 layers

Exclusive self-attention applied on all layers

parameters: {"layers":11}

Partial RoPE

Rotary positional embeddings applied partially on 16 of 64 head dimensions

parameters: {"dimensions":"16/64"}

Tied FP16 embeddings

Weight tying of embeddings in FP16 precision

parameters: null

U-Net skip connections

Skip connections inspired by U-Net architecture

parameters: null

Initialization

OrthoInit

Orthogonal initialization of weights

Optimizer

Muon

weight_decay: 0.04

momentum: 0.92

other_params: {"momentum_schedule":"0.92->0.99 over 1500 steps"}

Adam

weight_decay: null

momentum: null

other_params: {"lr_embeddings":0.035,"lr_scalars":0.03}

Weight Averaging

EMA

parameters: {"decay":0.997}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

weight decay

parameters: {"weight_decay":0.04}

gradient clipping

parameters: {"clip_value":0.3}

Other

training_techniques

Late QAT via STE applied during final 15% of training

parameters: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

Use of MLP 3.5x expansion with LeakyReLU(0.5)^2 activation
Integration of SmearGate gating mechanism
Combination of BigramHash and TrigramHash embeddings
Value Residual (ResFormer) caching and blending of layer 0 values
Gated Attention with per-head sigmoid gating and bias initialization
Exclusive self-attention (XSA) applied on all 11 layers
Partial RoPE applied on a subset of head dimensions (16/64)
Late Quantization Aware Training (QAT) via STE in final 15% of training
Use of Muon optimizer with momentum scheduling
Orthogonal initialization (OrthoInit) of weights
U-Net style skip connections in Transformer architecture
Int6 uniform quantization combined with GPTQ-lite and per-row 5-percentile clipping