PR #966
openMixture of Convolutions (MoC): token-adaptive short convolutions via kernel mixtures
by andrewmouldonView on GitHub
val_bpb
1.2162
Architecture
Transformer
Optimizer
—
Artifact Size
16MB
Training Techniques
Architecture
short convolution
Replaces static short convolution with token-adaptive convolution kernels formed as mixtures over shared basis kernels.
parameters: {"k":8}
short convolution
MoC reduces to standard short convolution when the kernel bank has a single basis kernel.
parameters: {"k":1}
MLP expansion
Adjusted MLP expansion to keep models within the parameter budget.
parameters: {"baseline":"2.00x","short_conv":"1.99x","moc":"1.93x"}
Novel Contributions
- Mixture of Convolutions (MoC): token-adaptive short convolutions via kernel mixtures
- Per-token routing over a small shared bank of basis kernels
- Dynamic local operator that generalizes standard short convolution
- Improved BPB over baseline and static short convolution in fixed-step experiments
- Demonstration that direct per-token kernel projection performed poorly while mixture-based kernels were stable