val_bpb
1.1460
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Quantization
GPTQ
bits: 5
scope: model weights
Weight Averaging
SWA
parameters: null
Architecture
XSA
Uses XSA as part of the baseline model configuration.
parameters: null
Value Residual
Uses VE / value residual technique as part of the baseline model configuration.
parameters: null
Test-Time Training
score-first TTT
parameters: null
Regularization
focal loss
parameters: {"gamma":2}
Novel Contributions
- Replaces standard cross-entropy with focal loss during training
- Uses focal loss with gamma=2.0 to down-weight easy tokens
- Builds on the Approach B baseline with Int5 GPTQ, SWA, XSA, VE, and TTT
- Shows that focal loss hurts validation BPB relative to the baseline