val_bpb
1.1478
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,957,281 bytes
Training Techniques
Quantization
STE QAT
bits: 5
scope: MLP
STE QAT
bits: 6
scope: attention and bigram-proj
Architecture
SmearGate
Learned previous-token blending at the embedding layer
parameters: null
BigramHash
Hash-based bigram embedding table
parameters: {"dimensions":128,"table_size":10240}
MLP3x
MLP with 3x expansion
parameters: {"hidden_size":1536}
KV head count
Grouped-query attention with 4 KV heads and 8 attention heads
parameters: {"heads":8,"kv_heads":4,"layers":10,"dim":512}
tied embeddings
Input and output embeddings are tied
parameters: {"vocab_size":1024}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.02,"warmup_momentum_start":0.92,"warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"scalar_lr":0.02,"tied_embed_lr":0.03}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every_steps":50,"checkpoints_averaged":24}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Initialization
OrthoInit
Orthogonal weight initialization with muP output-projection scaling
Regularization
weight decay
parameters: {"muon_wd":0.04,"adamw_wd":0.04}
Novel Contributions
- Mixed-precision QAT with int5 STE for MLP and int6 STE for attention/bigram projection
- STE quantization aligned exactly with the export-time per-row quantization scheme
- QAT enabled from step 0 on the full SOTA stack
- Combination of QAT with the existing SOTA architecture features such as SmearGate and BigramHash