PR #1239

open

[Non-Record] Whirlpool v5b — Non-Euclidean Lorentzian Attention on the Hyperboloid Manifold

by tmancinoView on GitHub

val_bpb

1.5918

Architecture

Transformer

Optimizer

MuonAdamW

Artifact Size

12.2 MB

Training Techniques

Architecture

GQA

Grouped query attention with 12 heads and 6:1 KV grouping.

parameters: {"heads":12,"kv_grouping":"6:1","head_dim":64}

depth recurrence

3 shared blocks are reused across 8 orbits to create depth through weight sharing.

parameters: {"blocks":3,"orbits":8}

attention modification

Lorentzian attention on the hyperboloid manifold using Minkowski inner products instead of dot-product attention.

parameters: {"curvature_range":[0.1,2]}

LeakyReLU

MLP uses fused LeakyReLU(0.5)^2 activation.

parameters: {"slope":0.5}

Optimizer

MuonAdamW

weight_decay: 0.12

momentum: 0.85

other_params: {"lr":0.04}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

int8

bits: 8

scope: model weights

Compression

zlib

level: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.0005,"steps":1}

LR Schedule

linear warmup

parameters: {"warmup_fraction":0.2}

cosine decay

parameters: null

Regularization

weight decay

parameters: {"value":0.12}

Novel Contributions

Lorentzian attention using Minkowski inner products on a hyperboloid manifold
Custom Flash Lorentz Attention Triton kernel with fused projection, inner product, and centroid aggregation
Progressive curvature orbits with different curvature values across shared blocks
Scale clamping and extended warmup to stabilize non-Euclidean training
Parallel GPU test-time training evaluation with best-result selection