PR #914

open

[Non-Record] Hymba-LongContext: 32K context training via hybrid SSM + SWA (1.1873 BPB)

by mkenney2View on GitHub

val_bpb

1.1873

Architecture

Hybrid

Optimizer

SGD

Artifact Size

14.3-14.6 MB

Training Techniques

Architecture

Mamba

Selective state space model branch for O(1) per-token recurrent sequence processing.

parameters: null

sliding window attention

Attention limited to a fixed local window for constant per-token cost.

parameters: {"window_size":512}

GQA

Grouped query attention with shared KV heads.

parameters: {"heads":8,"kv_heads":4}

RoPE

Rotary positional embeddings used in the attention branch.

parameters: null

QK-norm

Normalization applied to query/key representations.

parameters: null

LeakyReLU

LeakyReLU activation in the MLP.

parameters: {"slope":0.9}

U-Net skip connections

Skip connections across layers in a U-Net style.

parameters: null

SmearGate

SmearGate embedding/gating component.

parameters: null

BigramHash

Bigram hash embedding component.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

zstd

level: 22

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"freeze_blocks":2}

Sequence Length

sequence_length

train_length: 32768

eval_length: 524288

LR Schedule

cosine decay

parameters: {"warmdown_iters":3000}

Novel Contributions

Hybrid SSM + sliding window attention architecture enabling near-constant-cost long-context training
Training at 32,768-token context, far longer than the standard 1,024-token baseline
Demonstration that step time remains nearly constant from 8K to 64K context
Score-first test-time training to improve post-quantization validation BPB
Compact int6 + zstd artifact under the 16 MB limit