PR #1360
openNon-record: Gaussian per-token loss reweighting — what goes wrong and why (+0.014 bpb)
by JulianTang2027View on GitHub
val_bpb
1.1585
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,050,780 bytes
Training Techniques
Architecture
BigramHash
Uses a BigramHash(10240) component in the base model.
parameters: {"size":10240}
Weight Averaging
SWA
parameters: {"decay":0.4}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Other
other
Gaussian per-token loss reweighting centered on batch mean per-token loss, with z-score based weights and sigma-controlled downweighting of easy and hard tokens.
parameters: {"sigma":2,"normalized_weights":true}
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Gaussian per-token loss reweighting as a training-time loss shaping experiment
- Negative-result analysis showing improved weighted val_loss can hide worse competition metric performance
- Diagnosis that batch-relative loss reweighting optimizes a different objective than unweighted sliding-window BPB
- Comparison against a PR #180 baseline with identical seed and hardware