PR #286

open

Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628)

by chris-buckleyView on GitHub

val_bpb

1.1628

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,481,841 bytes

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP int5, attention int6

QAT

bits: null

scope: final phase only

Architecture

SmearGate

gated residual smearing for cheap inter-token mixing

parameters: null

BigramHash

4096-bucket bigram embedding for token-pair context without a full bigram table

parameters: {"vocab_size":4096,"dimension":128}

Initialization

Orthogonal init

orthogonal initialization with muP-style output projection scaling for stable deep training

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"decoupled_weight_decay":true}

Weight Averaging

SWA

parameters: {"start_frac":0.5,"every_steps":50,"num_checkpoints":15}

Evaluation

sliding window eval

parameters: {"stride":64,"full_tail_handling":true}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Regularization

weight decay

parameters: {"muon_weight_decay":0.04,"adam_weight_decay":0.01}

Other

other

late QAT starting at 85% wallclock to avoid always-on STE instability while closing most of the quantization gap

parameters: {"start_frac":0.85}

Novel Contributions

Mixed-precision int5 MLP / int6 attention export to fit a 10-layer model under the 16 MB cap
SmearGate for cheap inter-token mixing without learned parameters
BigramHash 4096-bucket bigram embedding for token-pair context
Late QAT starting at 85% wallclock instead of always-on STE
Orthogonal initialization with muP-style output projection scaling
Decoupled Muon weight decay and SWA during warmdown
Sliding-window evaluation with stride 64 and full-tail handling