PR #1239
open[Non-Record] Whirlpool v5b — Non-Euclidean Lorentzian Attention on the Hyperboloid Manifold
by tmancinoView on GitHub
val_bpb
1.5918
Architecture
Transformer
Optimizer
MuonAdamW
Artifact Size
12.2 MB
Training Techniques
Architecture
GQA
Grouped query attention with 12 heads and 6:1 KV grouping.
parameters: {"heads":12,"kv_grouping":"6:1","head_dim":64}
depth recurrence
3 shared blocks are reused across 8 orbits to create depth through weight sharing.
parameters: {"blocks":3,"orbits":8}
attention modification
Lorentzian attention on the hyperboloid manifold using Minkowski inner products instead of dot-product attention.
parameters: {"curvature_range":[0.1,2]}
LeakyReLU
MLP uses fused LeakyReLU(0.5)^2 activation.
parameters: {"slope":0.5}
Optimizer
MuonAdamW
weight_decay: 0.12
momentum: 0.85
other_params: {"lr":0.04}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0005,"steps":1}
LR Schedule
linear warmup
parameters: {"warmup_fraction":0.2}
cosine decay
parameters: null
Regularization
weight decay
parameters: {"value":0.12}
Novel Contributions
- Lorentzian attention using Minkowski inner products on a hyperboloid manifold
- Custom Flash Lorentz Attention Triton kernel with fused projection, inner product, and centroid aggregation
- Progressive curvature orbits with different curvature values across shared blocks
- Scale clamping and extended warmup to stabilize non-Euclidean training
- Parallel GPU test-time training evaluation with best-result selection