PR #1551
open[notable non record] Train-Time Overparameterization: Better Models Through Transient Expansion
by andrewmouldonView on GitHub
val_bpb
1.2199
Architecture
Transformer
Optimizer
—
Artifact Size
15.9MB
Training Techniques
Architecture
MLP expansion
Temporarily expands the MLP during training and later consolidates back to the original width before inference.
parameters: {"baseline_expand":"2x","temporary_expand":"4x","effective_training_width":"8x"}
Other
other
Train-time overparameterization with staged expansion, stochastic gating, and later pruning/consolidation to the target model size.
parameters: {"stages":["0-5% full expansion","5-25% stochastic gating","25-100% pruned target-width training"]}
other
Discrete stochastic neuron gating using Bernoulli masks with a straight-through gradient estimator.
parameters: {"gating":"Bernoulli","estimator":"straight-through"}
Regularization
structured pruning
parameters: {"top_k":true,"gradual_budget_annealing":true}
LR Schedule
budget annealing
parameters: {"start":"5%","end":"15%","consolidate_at":"25%"}
Novel Contributions
- Train-Time Overparameterization (TTO): temporarily expand MLP capacity during training and consolidate back to the original size for inference.
- Use stochastic discrete gating to identify which neurons survive consolidation.
- Introduce a staged training schedule with early full expansion, controlled sparsification, and final pruning.
- Show consistent BPB improvements over a strong baseline across three seeds at the same final model size.
- Provide evidence that the gain is not explained solely by transferring the selected neuron subset, suggesting an optimization-driven benefit.