PR #46

closed

Optimized SOTA Submission: 1.2697 bpb

val_bpb
1.2697
Architecture
Transformer
Optimizer
Muon
Artifact Size
11.0MB

Training Techniques

Quantization
int8
bits: 8
scope: all
Architecture
KV head count
Used a 9-layer, 432-dim Transformer with efficient GQA and reduced KV heads for better parameter efficiency.
parameters: {"layers":9,"dim":432,"heads":8,"kv_heads":2,"mlp_mult":2}
Optimizer
Muon
weight_decay: null
momentum: 0.9
other_params: {"beta1":0.85,"beta2":0.98,"grad_clip":1}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
linear warmup
parameters: {"warmup_steps":100}
Regularization
gradient clipping
parameters: {"norm":1}
Other
other
Larger batch training with systematic hyperparameter tuning and full 10-minute wallclock utilization.
parameters: {"train_batch_tokens":786432,"max_wallclock_seconds":600,"experiments":8}

Novel Contributions

  • Systematic optimization campaign from 1.42 to 1.27 bpb
  • 9x432 Transformer with efficient GQA and 2 KV heads
  • Large-batch training with conservative learning rates
  • Full utilization of the 10-minute training budget
  • int8 + zlib compressed submission artifact