PR #414

closed

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)

by signalrushView on GitHub

val_bpb

1.1233

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.55 MB

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: MLP and attention weights

QAT

bits: 6

scope: model weights

int8

bits: 8

scope: embeddings

Architecture

MLP3x

3x MLP expansion with relu-squared activation

parameters: {"expansion":3}

XSA

Efficient Partial XSA on the last 4 layers, GQA-aware and zero-alloc

parameters: {"layers":4}

Partial RoPE

Partial rotary positional embeddings with NTK-aware scaling

parameters: {"dimensions":"16/64"}

SmearGate

Custom gating mechanism used in the model

parameters: null

BigramHash

Bigram hashing feature with 2048 buckets and dim 128

parameters: {"buckets":2048,"dim":128}

tied embeddings

Input and output embeddings are tied

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025,"warmup_momentum":"0.92->0.99 over 1500 steps"}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"lr":0.035,"scope":"embeddings"}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"lr":0.025,"scope":"scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997,"every_step":true}

SWA

parameters: {"frequency":50,"start_condition":"scale<0.2","tight":true}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown3500

parameters: {"warmdown_steps":3500}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections

Novel Contributions

GPTQ-lite per-row optimal clip percentile search for int6 quantization
EMA weight averaging applied every training step before quantization
Longer warmdown schedule (3500 iterations) compared with prior submission
Higher late QAT threshold (0.15) to reduce quantization gap
Combined post-training optimization and training hyperparameter tuning to achieve a new record