val_bpb
1.2050
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
factorized late layers
Uses late-layer factorization in the model.
parameters: null
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"stable optimizer settings":true}
LR Schedule
healing phase
parameters: {"reduced_lr":true,"reduced_weight_decay":true,"near_end_of_training":true}
Novel Contributions
- Late-layer factorization
- Final healing phase with reduced learning rate and weight decay
- Stable optimizer settings
- Self-contained training file following the train_gpt_* pattern