PR #219
openNon-record: 12L Int5-MLP + Int6-Attn mixed quantization, val_bpb=1.1541
by alertcatView on GitHub
val_bpb
1.1541
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP weights int5, attention weights int6, tied embeddings fp16
Architecture
SmearGate
Learned token blending gate
parameters: null
BigramHash
Bigram hashing feature module
parameters: {"buckets":2048,"dimension":128}
MLP3x
MLP with 3x expansion and relu-squared activation
parameters: {"hidden":1536}
tied embeddings
Input and output embeddings are tied
parameters: {"vocab":1024}
KV head count
Grouped-query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Skip connections across layers in a U-Net-like pattern
parameters: null
Initialization
OrthoInit
Orthogonal initialization with muP scaling
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"AdamW_weight_decay":0.04}
Weight Averaging
SWA
parameters: {"checkpoint_avg_count":7,"interval_steps":200}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Other
other
Training with 12 transformer layers, 512 dimension, 8 heads, 4 KV heads, and 29.2M parameters
parameters: {"layers":12,"dim":512,"heads":8,"kv_heads":4,"parameters_m":29.2}
Novel Contributions
- Mixed precision-tiered quantization using int5 for MLP weights and int6 for attention weights
- Using int5 compression savings to fund a 12th transformer layer within the 16MB budget
- SmearGate learned token blending
- BigramHash feature module
- SWA checkpoint averaging during warmdown
- U-Net skip connections with orthogonal and muP-scaled initialization