PR #599
open[Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828)
by mkenney2View on GitHub
val_bpb
1.1828
Architecture
Hybrid Attention + Mamba SSM
Optimizer
Muon (matrix), Adam (scalar/embed)
Artifact Size
~15.1 MB
Training Techniques
Quantization
int6
bits: 6
scope: all
Architecture
Hybrid Attention + Mamba SSM
7-layer hybrid model running attention and Mamba SSM in parallel within each block, merged by learned weighted average
parameters: {"layers":7,"attention_heads":8,"kv_heads":4,"ssm_state_size":8,"mlp_multiplier":4}
Optimizer
Muon (matrix), Adam (scalar/embed)
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.02,"scalar_lr":0.02}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000,"shape":"cosine"}
Novel Contributions
- First competitive non-transformer architecture in the competition
- Hybrid model combining standard GQA attention and Mamba SSM in parallel within each block
- Learned weighted average merging of attention and Mamba branch outputs
- Fused input projection for K, V, and Mamba for GPU efficiency
- Shallow models (7 layers) outperform deeper transformer baselines at given compute budget
- Training stability improvements (lower LR and aggressive cosine warmdown) reduce quantization gap without QAT
- Minimal overhead of Mamba branch on multi-GPU training