PR #1380

open

Non-record: Focal Loss for LM Pretraining — 1.1567 int8 BPB on RTX 4000 Ada (3-line change)

by ranausmanaiView on GitHub
val_bpb
1.1567
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Regularization
label smoothing
parameters: null
weight decay
parameters: null
LR Schedule
cosine decay
parameters: {"min_lr_frac":0.1}
Architecture
encoder-decoder split
Asymmetric 1/10 encoder-decoder split with one encoder layer instead of the default half-split.
parameters: {"num_encoder_layers":1}
Other
other
Focal loss applied to language model pretraining to down-weight easy tokens and focus on hard-to-predict tokens.
parameters: {"gamma":8}

Novel Contributions

  • Applied focal loss to language model pretraining
  • Combined focal loss with cosine learning-rate decay
  • Used an asymmetric 1/10 encoder-decoder split
  • Reported 1.1567 int8 BPB on a single RTX 4000 Ada with only a few code changes
  • Showed monotonic improvement as focal gamma increased