PR #599

open

[Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828)

by mkenney2View on GitHub
val_bpb
1.1828
Architecture
Hybrid Attention + Mamba SSM
Optimizer
Muon (matrix), Adam (scalar/embed)
Artifact Size
~15.1 MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
Hybrid Attention + Mamba SSM
7-layer hybrid model running attention and Mamba SSM in parallel within each block, merged by learned weighted average
parameters: {"layers":7,"attention_heads":8,"kv_heads":4,"ssm_state_size":8,"mlp_multiplier":4}
Optimizer
Muon (matrix), Adam (scalar/embed)
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.02,"scalar_lr":0.02}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000,"shape":"cosine"}

Novel Contributions

  • First competitive non-transformer architecture in the competition
  • Hybrid model combining standard GQA attention and Mamba SSM in parallel within each block
  • Learned weighted average merging of attention and Mamba branch outputs
  • Fused input projection for K, V, and Mamba for GPU efficiency
  • Shallow models (7 layers) outperform deeper transformer baselines at given compute budget
  • Training stability improvements (lower LR and aggressive cosine warmdown) reduce quantization gap without QAT
  • Minimal overhead of Mamba branch on multi-GPU training