PR #459

open

Weight Entropy Regularization: Improved SWA Averaging (+0.028 BPB)

val_bpb

1.1490

Architecture

—

Optimizer

—

Artifact Size

—

Training Techniques

Weight Averaging

SWA

parameters: {"start_step":5500,"warmdown_steps_remaining":1200}

Regularization

weight entropy regularization

parameters: {"lambda":0.002}

entropy token masking

parameters: null

Other

other

Entropy-regularized weights are intended to reduce variance across checkpoints so SWA averaging is more effective.

parameters: null

Architecture

depth recurrence

Tested recurrent depth-sharing variants as negative experiments.

parameters: {"layers":3,"loops":4}

depth recurrence

Tested recurrent depth-sharing variants as negative experiments.

parameters: {"layers":4,"loops":3}

depth recurrence

Tested recurrent depth-sharing variants as negative experiments.

parameters: {"layers":2,"loops":6}

Kronecker attention

Kronecker Q/K attention variant tested as a negative experiment.

parameters: null

skip-gram hash

Hash-based skip-gram feature variant tested as a negative experiment.

parameters: null

Weight entropy regularization that adds an entropy penalty to weights during training
Improved SWA averaging by making checkpoints more consistent across training
Reported +0.028 BPB improvement at step 8500 relative to baseline
Demonstrated no effect during normal training but benefit during SWA warmdown
Documented 15 negative-result experiments across multiple alternative techniques