PR #1752

open

Add healing-phase training submission (1.205 val_bpb)

by luccifer00View on GitHub
val_bpb
1.2050
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
factorized late layers
Uses late-layer factorization in the model.
parameters: null
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"stable optimizer settings":true}
LR Schedule
healing phase
parameters: {"reduced_lr":true,"reduced_weight_decay":true,"near_end_of_training":true}

Novel Contributions

  • Late-layer factorization
  • Final healing phase with reduced learning rate and weight decay
  • Stable optimizer settings
  • Self-contained training file following the train_gpt_* pattern