PR #599

open

[Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828)

val_bpb

1.1828

Architecture

Hybrid Attention + Mamba SSM

Optimizer

Muon (matrix), Adam (scalar/embed)

Artifact Size

~15.1 MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

Hybrid Attention + Mamba SSM

7-layer hybrid model running attention and Mamba SSM in parallel within each block, merged by learned weighted average

parameters: {"layers":7,"attention_heads":8,"kv_heads":4,"ssm_state_size":8,"mlp_multiplier":4}

Optimizer

Muon (matrix), Adam (scalar/embed)

weight_decay: null

momentum: null

other_params: {"matrix_lr":0.02,"scalar_lr":0.02}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000,"shape":"cosine"}

First competitive non-transformer architecture in the competition
Hybrid model combining standard GQA attention and Mamba SSM in parallel within each block
Learned weighted average merging of attention and Mamba branch outputs
Fused input projection for K, V, and Mamba for GPU efficiency
Shallow models (7 layers) outperform deeper transformer baselines at given compute budget
Training stability improvements (lower LR and aggressive cosine warmdown) reduce quantization gap without QAT
Minimal overhead of Mamba branch on multi-GPU training