PR #459
openWeight Entropy Regularization: Improved SWA Averaging (+0.028 BPB)
by mer2234View on GitHub
val_bpb
1.1490
Architecture
—
Optimizer
—
Artifact Size
—
Training Techniques
Weight Averaging
SWA
parameters: {"start_step":5500,"warmdown_steps_remaining":1200}
Regularization
weight entropy regularization
parameters: {"lambda":0.002}
entropy token masking
parameters: null
Other
other
Entropy-regularized weights are intended to reduce variance across checkpoints so SWA averaging is more effective.
parameters: null
Architecture
depth recurrence
Tested recurrent depth-sharing variants as negative experiments.
parameters: {"layers":3,"loops":4}
depth recurrence
Tested recurrent depth-sharing variants as negative experiments.
parameters: {"layers":4,"loops":3}
depth recurrence
Tested recurrent depth-sharing variants as negative experiments.
parameters: {"layers":2,"loops":6}
Kronecker attention
Kronecker Q/K attention variant tested as a negative experiment.
parameters: null
skip-gram hash
Hash-based skip-gram feature variant tested as a negative experiment.
parameters: null
Novel Contributions
- Weight entropy regularization that adds an entropy penalty to weights during training
- Improved SWA averaging by making checkpoints more consistent across training
- Reported +0.028 BPB improvement at step 8500 relative to baseline
- Demonstrated no effect during normal training but benefit during SWA warmdown
- Documented 15 negative-result experiments across multiple alternative techniques