PR #535

open

Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb: 1.1204)

by raahilshahView on GitHub

val_bpb

1.1204

Architecture

Transformer

Optimizer

Muon (matrices) and AdamW (embeddings and scalars)

Artifact Size

15.85 MB

Training Techniques

Quantization

Full GPTQ

bits: 6

scope: all weights except small tensors and tok_emb.weight (fp16)

QAT-export alignment

bits: 6

scope: per-row clipping with quantile(0.9995) in STE and export quantizer

Architecture

LeakyReLU(0.5)² activation

Replaces relu² in MLP to prevent dead neurons and double effective MLP capacity

parameters: null

XSA4

Exclusive Self-Attention on last 4 layers

parameters: {"layers":4}

Partial RoPE

Partial Rotary Positional Embeddings with NTK-aware scaling

parameters: {"dimensions":"16/64"}

LN Scale

LayerNorm scale factor 1/sqrt(layer_idx+1)

parameters: null

SmearGate

Temporal gating mechanism

parameters: null

BigramHash

Bigram hashing with 2048 buckets and 128-dim embedding

parameters: {"buckets":2048,"dimensions":128}

U-Net skips

U-Net style skip connections with 5 encoder and 6 decoder skips

parameters: {"encoder_skips":5,"decoder_skips":6}

EMA

Exponential Moving Average with decay 0.997

parameters: {"decay":0.997}

Weight Averaging

Tight SWA

parameters: {"frequency_steps":50,"scale_threshold":0.2}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025,"scope":"matrices"}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"lr_embeddings":0.035,"lr_scalars":0.025,"scope":"embeddings and scalars"}

Regularization

weight decay

parameters: {"weight_decay":0.04}

gradient clipping

parameters: {"clip_value":0.3}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

none

parameters: null

Initialization

Orthogonal init

Novel Contributions

LeakyReLU(0.5)² activation replacing relu² to prevent dead neurons and double effective MLP capacity
Full GPTQ quantization with Hessian calibration reducing quantization gap by 31%
QAT-export alignment using quantile(0.9995) clipping to match STE fake-quantizer and export quantizer