PR #459

open

Weight Entropy Regularization: Improved SWA Averaging (+0.028 BPB)

by mer2234View on GitHub
val_bpb
1.1490
Architecture
Optimizer
Artifact Size

Training Techniques

Weight Averaging
SWA
parameters: {"start_step":5500,"warmdown_steps_remaining":1200}
Regularization
weight entropy regularization
parameters: {"lambda":0.002}
entropy token masking
parameters: null
Other
other
Entropy-regularized weights are intended to reduce variance across checkpoints so SWA averaging is more effective.
parameters: null
Architecture
depth recurrence
Tested recurrent depth-sharing variants as negative experiments.
parameters: {"layers":3,"loops":4}
depth recurrence
Tested recurrent depth-sharing variants as negative experiments.
parameters: {"layers":4,"loops":3}
depth recurrence
Tested recurrent depth-sharing variants as negative experiments.
parameters: {"layers":2,"loops":6}
Kronecker attention
Kronecker Q/K attention variant tested as a negative experiment.
parameters: null
skip-gram hash
Hash-based skip-gram feature variant tested as a negative experiment.
parameters: null

Novel Contributions

  • Weight entropy regularization that adds an entropy penalty to weights during training
  • Improved SWA averaging by making checkpoints more consistent across training
  • Reported +0.028 BPB improvement at step 8500 relative to baseline
  • Demonstrated no effect during normal training but benefit during SWA warmdown
  • Documented 15 negative-result experiments across multiple alternative techniques