PR #1245
open[Non-Record] Hymba-8L: Hybrid SSM + Sliding Window Attention with 32K Context (1.1470 BPB)
by mkenney2View on GitHub
val_bpb
1.1470
Architecture
Hybrid
Optimizer
—
Artifact Size
15.7 MB
Training Techniques
Architecture
Mamba
Hybrid architecture combining Mamba SSM with sliding window attention in parallel within each block.
parameters: null
GQA
Grouped query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
RoPE
Uses rotary positional embeddings in the attention branch.
parameters: null
SmearGate
Uses SmearGate in the embedding/architecture stack.
parameters: null
BigramHash
Uses BigramHash embeddings.
parameters: null
U-Net skip connections
Adds U-Net style skip connections.
parameters: null
LeakyReLU
Uses LeakyReLU with slope 0.9 squared in the MLP.
parameters: {"slope":0.9}
weight tying
Untied embeddings / separate lm_head instead of tying with token embeddings.
parameters: null
KV head count
Uses 4 KV heads.
parameters: {"kv_heads":4}
depth recurrence
Hybrid recurrent SSM branch with selective scan and causal convolution.
parameters: {"state_dim":4}
Quantization
int8
bits: 8
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Test-Time Training
score-first TTT
parameters: {"epochs":25,"freeze_blocks":0,"learning_rate":0.002}
Sequence Length
sequence_length
train_length: 32768
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":7000,"shape":"cosine"}
Regularization
weight decay
parameters: {"weight_decay":0.15}
Other
other
Sliding window attention with a 1024-token window.
parameters: {"window":1024}
Novel Contributions
- Hybrid Mamba + sliding window attention architecture
- 32K training context under the same compute budget
- 8-layer Hymba variant with improved BPB
- Expanded sliding window attention to 1024 tokens
- Reduced Mamba state dimension to 4
- Untied embeddings for better speed and quality
- High weight decay with int8 quantization to fit under 16 MB
- Long warmdown schedule and extended score-first TTT