val_bpb
1.3286
Architecture
Transformer
Optimizer
Muon
Artifact Size
15 MB
Training Techniques
Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied input embeddings and output head embeddings.
parameters: null
Quantization
mixed int6/int5
bits: null
scope: attention weights and MLP weights
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam":true}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iterations":1200,"shape":"trapezoid"}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- Post-training quantization with int6 attention weights and int5 MLP weights
- Muon+Adam optimization with trapezoid learning-rate schedule
- Grouped query attention with tied embeddings in a 20-layer GPT-like model
- Int8 plus zlib compression to fit the final artifact within the size limit