PR #1551

open

[notable non record] Train-Time Overparameterization: Better Models Through Transient Expansion

by andrewmouldonView on GitHub

val_bpb

1.2199

Architecture

Transformer

Optimizer

—

Artifact Size

15.9MB

Training Techniques

Architecture

MLP expansion

Temporarily expands the MLP during training and later consolidates back to the original width before inference.

parameters: {"baseline_expand":"2x","temporary_expand":"4x","effective_training_width":"8x"}

Other

other

Train-time overparameterization with staged expansion, stochastic gating, and later pruning/consolidation to the target model size.

parameters: {"stages":["0-5% full expansion","5-25% stochastic gating","25-100% pruned target-width training"]}

other

Discrete stochastic neuron gating using Bernoulli masks with a straight-through gradient estimator.

parameters: {"gating":"Bernoulli","estimator":"straight-through"}

Regularization

structured pruning

parameters: {"top_k":true,"gradual_budget_annealing":true}

LR Schedule

budget annealing

parameters: {"start":"5%","end":"15%","consolidate_at":"25%"}

Train-Time Overparameterization (TTO): temporarily expand MLP capacity during training and consolidate back to the original size for inference.
Use stochastic discrete gating to identify which neurons survive consolidation.
Introduce a staged training schedule with early full expansion, controlled sparsification, and final pruning.
Show consistent BPB improvements over a strong baseline across three seeds at the same final model size.
Provide evidence that the gain is not explained solely by transferring the selected neuron subset, suggesting an optimization-driven benefit.