val_bpb
1.3003
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
KV head count
Reduced key-value heads from 4 to 2 (GQA 8:2) to save parameters and improve throughput.
parameters: {"num_kv_heads":2}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"matrix_lr":0.048,"scalar_lr":0.03}
LR Schedule
warmdown
parameters: {"warmdown_iters":600}
Initialization
QK_GAIN_INIT
Lower initial attention sharpness via QK gain initialization set to 1.35.
Quantization
fp16
bits: 16
scope: embeddings
Novel Contributions
- Systematic hyperparameter tuning of the baseline 9x512 architecture
- Reduced NUM_KV_HEADS from 4 to 2
- Adjusted Muon and scalar learning rates
- Shortened warmdown schedule
- Increased Muon momentum
- Lowered QK gain initialization
- Exported token embeddings in fp16 instead of int8
- Focused on throughput-neutral changes to improve BPB under wallclock constraints