PR #1863

closed

Innovation Architecture: Arterial Mixer (N-Lane Transformer)

by seinareView on GitHub
val_bpb
1.2008
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
multi-stream Transformer
Uses N parallel Transformer streams ('arteries') with lane-wise mixers for cross-stream fusion and independent LM heads per artery during training.
parameters: {"arteries":2,"layers":10,"artery_dim":384,"heads":3,"windows":[1024,4096],"ctx":16384}
XSA
Applies XSA selectively to the smaller-window artery to improve performance.
parameters: {"artery":0,"window":1024}
RoPE
Uses RoPE on U-Net-style residual slots to encode slot position / residual-source identity.
parameters: null
U-Net skip connections
Adds U-Net-style residual slots at the target block, fused through the mixer KV path rather than direct addition.
parameters: null
sliding window attention
Uses sliding-window attention with different window sizes across arteries instead of full attention.
parameters: {"windows":[1024,4096]}
linear attention
Implements the artery mixer as multi-head linear attention over the artery dimension with ELU routing scores.
parameters: null
Evaluation
long context eval
parameters: {"context_length":16384}

Novel Contributions

  • N parallel Transformer streams ('arteries') with lane-wise mixers for cross-artery fusion
  • Independent LM heads per artery with probability-space averaging at validation
  • Selective XSA placement on the smaller-window artery
  • U-Net-style residual slots fused through the mixer KV path
  • RoPE applied to residual slots to encode slot identity
  • Causal artery mixing used to inspect incremental artery contributions