PR #478

open

New SOTA: 1.12676 BPB - 11L XSA-all(11) + GPTQ-lite + EMA + Late QAT

by gowtham0992View on GitHub

val_bpb

1.1268

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.7 MB

Training Techniques

Architecture

XSA

Exclusive Self Attention applied to all 11 layers instead of only the last few layers.

parameters: {"layers":11}

Partial RoPE

Rotary positional embeddings applied to only part of the dimensions with NTK-aware scaling.

parameters: {"dimensions":16,"total_dimensions":64}

SmearGate

Additional gating mechanism used in the architecture.

parameters: null

BigramHash

Hash-based bigram feature module with learned embeddings.

parameters: {"buckets":2048,"dimension":128}

tied embeddings

Input and output embeddings are tied.

parameters: null

Quantization

GPTQ-lite

bits: 6

scope: all large weights

QAT

bits: 6

scope: all

int8

bits: 8

scope: embeddings

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency":50,"start_condition":"scale<0.2"}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections.

LR Schedule

warmdown

parameters: {"warmdown_iterations":3500}

Regularization

layerwise LN scale

parameters: {"scale_rule":"1/sqrt(layer_idx+1)"}

Other

other

Late QAT with int6 STE fake-quantization when LR scale drops below 0.15.

parameters: {"lr_scale_threshold":0.15}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025,"warmup_momentum":"0.92->0.99 over 1500 steps"}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"lr_embeddings":0.035,"lr_scalars":0.025}

Novel Contributions

XSA applied to all 11 layers
GPTQ-lite optimal clip percentile search
EMA with tight SWA
Late QAT int6-all triggered at low learning-rate scale
Raw binary serialization with zstd level 22 compression
Removal of Backout mechanism improved compression quality
No pruning required for int6-all fitting under the size limit