PR #1360

open

Non-record: Gaussian per-token loss reweighting — what goes wrong and why (+0.014 bpb)

by JulianTang2027View on GitHub

val_bpb

1.1585

Architecture

Transformer

Optimizer

Muon

Artifact Size

16,050,780 bytes

Training Techniques

Architecture

BigramHash

Uses a BigramHash(10240) component in the base model.

parameters: {"size":10240}

Weight Averaging

SWA

parameters: {"decay":0.4}

Regularization

weight decay

parameters: {"weight_decay":0.04}

Other

other

Gaussian per-token loss reweighting centered on batch mean per-token loss, with z-score based weights and sigma-controlled downweighting of easy and hard tokens.

parameters: {"sigma":2,"normalized_weights":true}

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

Gaussian per-token loss reweighting as a training-time loss shaping experiment
Negative-result analysis showing improved weighted val_loss can hide worse competition metric performance
Diagnosis that batch-relative loss reweighting optimizes a different objective than unweighted sliding-window BPB
Comparison against a PR #180 baseline with identical seed and hardware