PR #180
RECORDclosedRecord: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds)
by thwu1View on GitHub
val_bpb
1.1428
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.52MB
Training Techniques
Quantization
mixed int5/int6
bits: 5
scope: MLP weights
mixed int5/int6
bits: 6
scope: attention weights
Architecture
BigramHash
Hashes consecutive token pairs into a learned embedding table to reduce token-pair collisions.
parameters: {"buckets":10240,"dim":128}
SmearGate
Gating mechanism used as part of the model architecture.
parameters: null
MLP3x
Transformer MLP uses 3x expansion.
parameters: {"hidden":1536}
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
tied embeddings
Input and output embeddings are tied.
parameters: null
U-Net skip connections
Skip connections added in a U-Net-like pattern.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every_steps":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections.
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":20}
Regularization
weight decay
parameters: {"value":0.04}
magnitude pruning
parameters: {"sparsity":0.03}
Novel Contributions
- Mixed int5 MLP / int6 attention quantization to save artifact size
- Adding a 10th transformer layer funded by int5 compression savings
- Muon weight decay tuning to improve quantization friendliness
- SWA with checkpoints collected from the last 40% of training
- BigramHash with 10240 buckets to reduce token-pair collisions
- SmearGate and OrthoInit inherited from prior work