PR #271

open

Non-record: HyperparamTuned KV2 + FP16 Embed

val_bpb
1.3003
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
KV head count
Reduced key-value heads from 4 to 2 (GQA 8:2) to save parameters and improve throughput.
parameters: {"num_kv_heads":2}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"matrix_lr":0.048,"scalar_lr":0.03}
LR Schedule
warmdown
parameters: {"warmdown_iters":600}
Initialization
QK_GAIN_INIT
Lower initial attention sharpness via QK gain initialization set to 1.35.
Quantization
fp16
bits: 16
scope: embeddings

Novel Contributions

  • Systematic hyperparameter tuning of the baseline 9x512 architecture
  • Reduced NUM_KV_HEADS from 4 to 2
  • Adjusted Muon and scalar learning rates
  • Shortened warmdown schedule
  • Increased Muon momentum
  • Lowered QK gain initialization
  • Exported token embeddings in fp16 instead of int8
  • Focused on throughput-neutral changes to improve BPB under wallclock constraints