PR #979

open

Record: 1.1387 BPB — 11L LeakyReLU² + Early QAT@0.5 + GPTQ-lite + EMA

by 0xadvaitView on GitHub

val_bpb

1.1387

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.6 MB

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU(0.5) squared activation in the MLPs.

parameters: {"squared":true,"negative_slope":0.5}

MLP3x

Uses 3x MLP expansion.

parameters: {"expansion":3}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

U-Net skip connections

Adds U-Net style encoder-decoder skip connections.

parameters: {"encoder_layers":5,"decoder_layers":6}

weight tying

Ties input embeddings and output embeddings.

parameters: null

RoPE

Uses rotary positional embeddings.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"adamw_lr_embeddings":0.035,"adamw_lr_scalars":0.025,"momentum_warmup":"0.85->0.95"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

STE QAT

bits: 6

scope: attn/MLP weights

GPTQ-lite

bits: 6

scope: attn/MLP weights

int8

bits: 8

scope: embeddings

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Compression

zstd

level: 22

Novel Contributions

Early QAT starting at LR scale < 0.5 to allow ~1400 QAT steps before cutoff
Reduced post-quantization gap from 0.28 BPB to 0.004 BPB
11-layer Transformer with LeakyReLU(0.5)^2 MLPs and U-Net skip connections
GPTQ-lite per-row clip percentile search for int6 export
Achieved 1.1387 BPB mean over 3 seeds with stride-64 sliding window evaluation