val_bpb
1.2008
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
multi-stream Transformer
Uses N parallel Transformer streams ('arteries') with lane-wise mixers for cross-stream fusion and independent LM heads per artery during training.
parameters: {"arteries":2,"layers":10,"artery_dim":384,"heads":3,"windows":[1024,4096],"ctx":16384}
XSA
Applies XSA selectively to the smaller-window artery to improve performance.
parameters: {"artery":0,"window":1024}
RoPE
Uses RoPE on U-Net-style residual slots to encode slot position / residual-source identity.
parameters: null
U-Net skip connections
Adds U-Net-style residual slots at the target block, fused through the mixer KV path rather than direct addition.
parameters: null
sliding window attention
Uses sliding-window attention with different window sizes across arteries instead of full attention.
parameters: {"windows":[1024,4096]}
linear attention
Implements the artery mixer as multi-head linear attention over the artery dimension with ELU routing scores.
parameters: null
Evaluation
long context eval
parameters: {"context_length":16384}
Novel Contributions
- N parallel Transformer streams ('arteries') with lane-wise mixers for cross-artery fusion
- Independent LM heads per artery with probability-space averaging at validation
- Selective XSA placement on the smaller-window artery
- U-Net-style residual slots fused through the mixer KV path
- RoPE applied to residual slots to encode slot identity
- Causal artery mixing used to inspect incremental artery contributions