val_bpb
1.2697
Architecture
Transformer
Optimizer
Muon
Artifact Size
11.0MB
Training Techniques
Quantization
int8
bits: 8
scope: all
Architecture
KV head count
Used a 9-layer, 432-dim Transformer with efficient GQA and reduced KV heads for better parameter efficiency.
parameters: {"layers":9,"dim":432,"heads":8,"kv_heads":2,"mlp_mult":2}
Optimizer
Muon
weight_decay: null
momentum: 0.9
other_params: {"beta1":0.85,"beta2":0.98,"grad_clip":1}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
linear warmup
parameters: {"warmup_steps":100}
Regularization
gradient clipping
parameters: {"norm":1}
Other
other
Larger batch training with systematic hyperparameter tuning and full 10-minute wallclock utilization.
parameters: {"train_batch_tokens":786432,"max_wallclock_seconds":600,"experiments":8}
Novel Contributions
- Systematic optimization campaign from 1.42 to 1.27 bpb
- 9x432 Transformer with efficient GQA and 2 KV heads
- Large-batch training with conservative learning rates
- Full utilization of the 10-minute training budget
- int8 + zlib compressed submission artifact