PR #162

RECORDclosed

Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483)

by raahilshahView on GitHub

val_bpb

1.1458

Architecture

GPT

Optimizer

Muon

Artifact Size

15.86MB

Training Techniques

Quantization

int6

bits: 6

scope: MLP and attention weights; fp16 passthrough for tied embeddings and last-layer key projection

Architecture

MLP3x

Increased MLP hidden size from 2x to 3x expansion to improve capacity.

parameters: {"hidden":1536}

SmearGate

Learned gate blending each token embedding with the previous token embedding to add lightweight bigram context.

parameters: {"params":512}

BigramHash

Hash-based bigram embedding table for adjacent token-pair context.

parameters: {"vocab_size":4096,"dim":128}

Initialization

OrthoInit

Orthogonal initialization for large weight matrices with muP-style output scaling.

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"adamw_weight_decay":0.01}

Weight Averaging

SWA

parameters: {"start_frac":0.5,"every_steps":50}

Compression

zstd

level: 22

Evaluation

stride-based eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

weight decay

parameters: {"muon_weight_decay":0.04,"adamw_weight_decay":0.01}

Novel Contributions

Per-row int6 quantization of MLP and attention weights with fp16 passthrough for sensitive components
3x MLP expansion enabled by int6 byte savings
SmearGate for blending current and previous token embeddings
BigramHash embedding for token-pair context
Orthogonal initialization with muP-style scaling
Muon optimizer with momentum warmup and weight decay
Stochastic Weight Averaging to smooth weights and improve quantization