PR #1863

closed

Innovation Architecture: Arterial Mixer (N-Lane Transformer)

val_bpb

1.2008

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

multi-stream Transformer

Uses N parallel Transformer streams ('arteries') with lane-wise mixers for cross-stream fusion and independent LM heads per artery during training.

parameters: {"arteries":2,"layers":10,"artery_dim":384,"heads":3,"windows":[1024,4096],"ctx":16384}

XSA

Applies XSA selectively to the smaller-window artery to improve performance.

parameters: {"artery":0,"window":1024}

RoPE

Uses RoPE on U-Net-style residual slots to encode slot position / residual-source identity.

parameters: null

U-Net skip connections

Adds U-Net-style residual slots at the target block, fused through the mixer KV path rather than direct addition.

parameters: null

sliding window attention

Uses sliding-window attention with different window sizes across arteries instead of full attention.

parameters: {"windows":[1024,4096]}

linear attention

Implements the artery mixer as multi-head linear attention over the artery dimension with ELU routing scores.

parameters: null

Evaluation

long context eval

parameters: {"context_length":16384}

N parallel Transformer streams ('arteries') with lane-wise mixers for cross-artery fusion
Independent LM heads per artery with probability-space averaging at validation
Selective XSA placement on the smaller-window artery
U-Net-style residual slots fused through the mixer KV path
RoPE applied to residual slots to encode slot identity
Causal artery mixing used to inspect incremental artery contributions