PR #146

closed

Non-record: Warmdown-Tuned Training (val_bpb=1.2987) on 1xRTX 5090

val_bpb

1.2987

Architecture

GPT

Optimizer

Muon

Artifact Size

15.8MB

Training Techniques

Architecture

tied embeddings

Uses tied input/output embeddings in the GPT model.

parameters: null

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

layer looping

Explores looping a smaller set of unique layers to form a wider/deeper effective model.

parameters: {"unique_layers":6,"model_dim":608,"looped_layers":9}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"muon_momentum_warmup_steps":50,"matrix_lr":0.04,"scalar_lr":0.04,"tied_embed_lr":0.05}

Other

other

parameters: {"variants_tested":3}

Identified warmdown_iters=3000 as the best learning-rate warmdown setting.
Showed that warmdown provides disproportionate BPB improvement per training step.
Tested register token approaches and ruled them out as ineffective at this scale.
Observed that longer warmdown revived a dead middle layer (layer 3).
Explored layer looping and a wider model as a follow-up direction.
Fit the submission under the 16MB artifact limit with int8+zlib roundtrip.