PR #146

closed

Non-record: Warmdown-Tuned Training (val_bpb=1.2987) on 1xRTX 5090

by swapp1990View on GitHub
val_bpb
1.2987
Architecture
GPT
Optimizer
Muon
Artifact Size
15.8MB

Training Techniques

Architecture
tied embeddings
Uses tied input/output embeddings in the GPT model.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
layer looping
Explores looping a smaller set of unique layers to form a wider/deeper effective model.
parameters: {"unique_layers":6,"model_dim":608,"looped_layers":9}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"muon_momentum_warmup_steps":50,"matrix_lr":0.04,"scalar_lr":0.04,"tied_embed_lr":0.05}
Other
other
Register token experiments were tested in multiple variants and found not to improve overall BPB.
parameters: {"variants_tested":3}

Novel Contributions

  • Identified warmdown_iters=3000 as the best learning-rate warmdown setting.
  • Showed that warmdown provides disproportionate BPB improvement per training step.
  • Tested register token approaches and ruled them out as ineffective at this scale.
  • Observed that longer warmdown revived a dead middle layer (layer 3).
  • Explored layer looping and a wider model as a follow-up direction.
  • Fit the submission under the 16MB artifact limit with int8+zlib roundtrip.