PR #966

open

Mixture of Convolutions (MoC): token-adaptive short convolutions via kernel mixtures

by andrewmouldonView on GitHub
val_bpb
1.2162
Architecture
Transformer
Optimizer
Artifact Size
16MB

Training Techniques

Architecture
short convolution
Replaces static short convolution with token-adaptive convolution kernels formed as mixtures over shared basis kernels.
parameters: {"k":8}
short convolution
MoC reduces to standard short convolution when the kernel bank has a single basis kernel.
parameters: {"k":1}
MLP expansion
Adjusted MLP expansion to keep models within the parameter budget.
parameters: {"baseline":"2.00x","short_conv":"1.99x","moc":"1.93x"}

Novel Contributions

  • Mixture of Convolutions (MoC): token-adaptive short convolutions via kernel mixtures
  • Per-token routing over a small shared bank of basis kernels
  • Dynamic local operator that generalizes standard short convolution
  • Improved BPB over baseline and static short convolution in fixed-step experiments
  • Demonstration that direct per-token kernel projection performed poorly while mixture-based kernels were stable