val_bpb
1.2381
Architecture
Transformer
Optimizer
—
Artifact Size
15.85MB
Training Techniques
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"max_wallclock_seconds":600}
Architecture
KV head count
Uses fewer KV heads than attention heads in the GPT-style model.
parameters: {"num_heads":8,"num_kv_heads":4,"layers":9,"model_dim":512}
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- Increased WARMDOWN_ITERS so cosine warmdown actually triggers within the wallclock-limited training run.
- Improved convergence by ensuring learning rate decay occurs in the final portion of training.
- Used a 3000-iteration warmdown schedule instead of the default 1200 iterations.