PR #966

open

Mixture of Convolutions (MoC): token-adaptive short convolutions via kernel mixtures

by andrewmouldonView on GitHub

val_bpb

1.2162

Architecture

Transformer

Optimizer

—

Artifact Size

16MB

Training Techniques

Architecture

short convolution

Replaces static short convolution with token-adaptive convolution kernels formed as mixtures over shared basis kernels.

parameters: {"k":8}

short convolution

MoC reduces to standard short convolution when the kernel bank has a single basis kernel.

parameters: {"k":1}

MLP expansion

Adjusted MLP expansion to keep models within the parameter budget.

parameters: {"baseline":"2.00x","short_conv":"1.99x","moc":"1.93x"}

Mixture of Convolutions (MoC): token-adaptive short convolutions via kernel mixtures
Per-token routing over a small shared bank of basis kernels
Dynamic local operator that generalizes standard short convolution
Improved BPB over baseline and static short convolution in fixed-step experiments
Demonstration that direct per-token kernel projection performed poorly while mixture-based kernels were stable