PR #1245

open

[Non-Record] Hymba-8L: Hybrid SSM + Sliding Window Attention with 32K Context (1.1470 BPB)

by mkenney2View on GitHub

val_bpb

1.1470

Architecture

Hybrid

Optimizer

—

Artifact Size

15.7 MB

Training Techniques

Architecture

Mamba

Hybrid architecture combining Mamba SSM with sliding window attention in parallel within each block.

parameters: null

GQA

Grouped query attention with 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

RoPE

Uses rotary positional embeddings in the attention branch.

parameters: null

SmearGate

Uses SmearGate in the embedding/architecture stack.

parameters: null

BigramHash

Uses BigramHash embeddings.

parameters: null

U-Net skip connections

Adds U-Net style skip connections.

parameters: null

LeakyReLU

Uses LeakyReLU with slope 0.9 squared in the MLP.

parameters: {"slope":0.9}

weight tying

Untied embeddings / separate lm_head instead of tying with token embeddings.

parameters: null

KV head count

Uses 4 KV heads.

parameters: {"kv_heads":4}

depth recurrence

Hybrid recurrent SSM branch with selective scan and causal convolution.

parameters: {"state_dim":4}

Quantization

int8

bits: 8

scope: all

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Test-Time Training

score-first TTT

parameters: {"epochs":25,"freeze_blocks":0,"learning_rate":0.002}

Sequence Length

sequence_length

train_length: 32768

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":7000,"shape":"cosine"}

Regularization

weight decay

parameters: {"weight_decay":0.15}

Other

other

Sliding window attention with a 1024-token window.

parameters: {"window":1024}

Novel Contributions

Hybrid Mamba + sliding window attention architecture
32K training context under the same compute budget
8-layer Hymba variant with improved BPB
Expanded sliding window attention to 1024 tokens
Reduced Mamba state dimension to 4
Untied embeddings for better speed and quality
High weight decay with int8 quantization to fit under 16 MB
Long warmdown schedule and extended score-first TTT