PR #180

RECORDclosed

Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)

by thwu1View on GitHub

val_bpb

1.1428

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.52MB

Training Techniques

Quantization

mixed int5/int6

bits: 5

scope: MLP weights

mixed int5/int6

bits: 6

scope: attention weights

Architecture

BigramHash

Hashes consecutive token pairs into a learned embedding table to reduce token-pair collisions.

parameters: {"buckets":10240,"dim":128}

SmearGate

Gating mechanism used as part of the model architecture.

parameters: null

MLP3x

Transformer MLP uses 3x expansion.

parameters: {"hidden":1536}

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

tied embeddings

Input and output embeddings are tied.

parameters: null

U-Net skip connections

Skip connections added in a U-Net-like pattern.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.02}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every_steps":50}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections.

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":20}

Regularization

weight decay

parameters: {"value":0.04}

magnitude pruning

parameters: {"sparsity":0.03}

Novel Contributions

Mixed int5 MLP / int6 attention quantization to save artifact size
Adding a 10th transformer layer funded by int5 compression savings
Muon weight decay tuning to improve quantization friendliness
SWA with checkpoints collected from the last 40% of training
BigramHash with 10240 buckets to reduce token-pair collisions
SmearGate and OrthoInit inherited from prior work