PR #219

open

Non-record: 12L Int5-MLP + Int6-Attn mixed quantization, val_bpb=1.1541

by alertcatView on GitHub

val_bpb

1.1541

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.9 MB

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP weights int5, attention weights int6, tied embeddings fp16

Architecture

SmearGate

Learned token blending gate

parameters: null

BigramHash

Bigram hashing feature module

parameters: {"buckets":2048,"dimension":128}

MLP3x

MLP with 3x expansion and relu-squared activation

parameters: {"hidden":1536}

tied embeddings

Input and output embeddings are tied

parameters: {"vocab":1024}

KV head count

Grouped-query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

U-Net skip connections

Skip connections across layers in a U-Net-like pattern

parameters: null

Initialization

OrthoInit

Orthogonal initialization with muP scaling

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"AdamW_weight_decay":0.04}

Weight Averaging

SWA

parameters: {"checkpoint_avg_count":7,"interval_steps":200}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Other

other

Training with 12 transformer layers, 512 dimension, 8 heads, 4 KV heads, and 29.2M parameters

parameters: {"layers":12,"dim":512,"heads":8,"kv_heads":4,"parameters_m":29.2}

Novel Contributions

Mixed precision-tiered quantization using int5 for MLP weights and int6 for attention weights
Using int5 compression savings to fund a 12th transformer layer within the 16MB budget
SmearGate learned token blending
BigramHash feature module
SWA checkpoint averaging during warmdown
U-Net skip connections with orthogonal and muP-scaled initialization