PR #563
openAdd submission: 10L Enhanced with BigramHash(12240) + SOTA techniques
by instax-duttaView on GitHub
val_bpb
1.1428
Architecture
Transformer
Optimizer
Muon
Artifact Size
within 16MB limit
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: int5 for MLP weights, int6 for attention weights, fp16 for tied embeddings and last-layer key projections
Architecture
BigramHash
Hash consecutive token pairs into a larger embedding table to reduce collisions
parameters: {"vocab_size":12288,"embedding_dim":128,"projection_dim":512}
SmearGate
Token-level recurrence for lightweight bigram signal
parameters: null
MLP3x
MLP with 3× expansion and ReLU² activation
parameters: {"expansion_factor":3,"hidden_dim":1536,"activation":"ReLU²"}
U-Net skip connections
Skip connections in U-Net style
parameters: null
tied embeddings
Weight tying between input and output embeddings
parameters: null
KV head count
4 KV heads with GQA
parameters: {"kv_heads":4,"attention_heads":8,"model_dim":512,"layers":10}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_momentum_start":0.92,"warmup_steps":1500,"adamw_weight_decay":0.04,"adamw_scope":"embeddings/scalars","matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"average_every":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
weight decay
parameters: {"weight_decay_value":0.04}
magnitude pruning
parameters: {"prune_percent":3}
Initialization
OrthoInit
Orthogonal initialization with muP scaling for output projections
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- Mixed int5/int6 quantization with int5 for MLP weights and int6 for attention weights to save artifact size
- Increased BigramHash vocabulary size from 10240 to 12288 to reduce hash collisions and improve bigram signal
- Use of SmearGate for token-level recurrence to enhance bigram signal
- Orthogonal initialization with muP scaling for output projections
- Muon optimizer with specific weight decay and momentum settings
- SWA starting from last 40% of training with averaging every 50 steps
- Sliding window evaluation with stride 64 for better context handling
- Combination of zstd-22 compression with 3% magnitude pruning to fit within 16MB artifact size limit