PR #1233

open

Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460

by ibarrajoView on GitHub
val_bpb
1.1460
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Quantization
GPTQ
bits: 5
scope: model weights
Weight Averaging
SWA
parameters: null
Architecture
XSA
Uses XSA as part of the baseline model configuration.
parameters: null
Value Residual
Uses VE / value residual technique as part of the baseline model configuration.
parameters: null
Test-Time Training
score-first TTT
parameters: null
Regularization
focal loss
parameters: {"gamma":2}

Novel Contributions

  • Replaces standard cross-entropy with focal loss during training
  • Uses focal loss with gamma=2.0 to down-weight easy tokens
  • Builds on the Approach B baseline with Int5 GPTQ, SWA, XSA, VE, and TTT
  • Shows that focal loss hurts validation BPB relative to the baseline