val_bpb
1.3477
Architecture
Hybrid
Optimizer
Muon
Artifact Size
13.0 MB
Training Techniques
Architecture
BigramHash
Adds a bigram hash embedding path alongside token embeddings.
parameters: {"dimensions":160,"size":16384}
SmearGate
Uses a smear gate after the embedding and bigram hash inputs.
parameters: null
U-Net skip connections
Uses U-Net style skip connections across HybridAtlasBlocks.
parameters: null
XSA
Uses causal self-attention in the temporal path.
parameters: {"heads":8,"layers_per_pass":2}
RoPE
Applies rotary positional embeddings in attention.
parameters: null
Gated Attention
Combines SSM and attention with a learned gate.
parameters: null
weight tying
Uses tied output projection with the embedding matrix.
parameters: null
KV head count
Uses five parallel kernel readout heads: Nyström, Gabor, Laplacian, Tucker GL, and Linear.
parameters: {"heads":5}
other
Geometric language model with Stäckel coordinate encoders and kernel-based spatial readout.
parameters: {"encoders":3,"encoder_dim":160}
other
Two-pass FFT SSM plus attention temporal path.
parameters: {"passes":2}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"decoupled_with":"AdamW for scalars"}
Weight Averaging
SWA
parameters: {"start":"last 40%"}
Compression
zstd
level: 22
Regularization
logit softcap
parameters: {"value":30}
weight decay
parameters: null
Other
other
Stäckel penalty encouraging soft diagonal covariance in the coordinate encoders.
parameters: {"beta":0.02}
Novel Contributions
- Geometric language model using learned Stäckel coordinate encoders
- Five parallel kernel readout heads for spatial readout
- Two-pass FFT SSM plus attention temporal path
- Stiefel-enforced chart encoders
- Learned mixture between spatial and temporal paths
- Surgical Muon routing with AdamW for scalar parameters
- SWA and zstd compression to fit the submission artifact