PR #1380

open

Non-record: Focal Loss for LM Pretraining — 1.1567 int8 BPB on RTX 4000 Ada (3-line change)

val_bpb

1.1567

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Regularization

label smoothing

parameters: null

weight decay

parameters: null

LR Schedule

cosine decay

parameters: {"min_lr_frac":0.1}

Architecture

encoder-decoder split

Asymmetric 1/10 encoder-decoder split with one encoder layer instead of the default half-split.

parameters: {"num_encoder_layers":1}

Other

other

Focal loss applied to language model pretraining to down-weight easy tokens and focus on hard-to-predict tokens.

parameters: {"gamma":8}