PR #1380
openNon-record: Focal Loss for LM Pretraining — 1.1567 int8 BPB on RTX 4000 Ada (3-line change)
by ranausmanaiView on GitHub
val_bpb
1.1567
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Regularization
label smoothing
parameters: null
weight decay
parameters: null
LR Schedule
cosine decay
parameters: {"min_lr_frac":0.1}
Architecture
encoder-decoder split
Asymmetric 1/10 encoder-decoder split with one encoder layer instead of the default half-split.
parameters: {"num_encoder_layers":1}
Other
other
Focal loss applied to language model pretraining to down-weight easy tokens and focus on hard-to-predict tokens.
parameters: {"gamma":8}
Novel Contributions
- Applied focal loss to language model pretraining
- Combined focal loss with cosine learning-rate decay
- Used an asymmetric 1/10 encoder-decoder split
- Reported 1.1567 int8 BPB on a single RTX 4000 Ada with only a few code changes
- Showed monotonic improvement as focal gamma increased