PR #914

open

[Non-Record] Hymba-LongContext: 32K context training via hybrid SSM + SWA (1.1873 BPB)

by mkenney2View on GitHub
val_bpb
1.1873
Architecture
Hybrid
Optimizer
SGD
Artifact Size
14.3-14.6 MB

Training Techniques

Architecture
Mamba
Selective state space model branch for O(1) per-token recurrent sequence processing.
parameters: null
sliding window attention
Attention limited to a fixed local window for constant per-token cost.
parameters: {"window_size":512}
GQA
Grouped query attention with shared KV heads.
parameters: {"heads":8,"kv_heads":4}
RoPE
Rotary positional embeddings used in the attention branch.
parameters: null
QK-norm
Normalization applied to query/key representations.
parameters: null
LeakyReLU
LeakyReLU activation in the MLP.
parameters: {"slope":0.9}
U-Net skip connections
Skip connections across layers in a U-Net style.
parameters: null
SmearGate
SmearGate embedding/gating component.
parameters: null
BigramHash
Bigram hash embedding component.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
zstd
level: 22
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"freeze_blocks":2}
Sequence Length
sequence_length
train_length: 32768
eval_length: 524288
LR Schedule
cosine decay
parameters: {"warmdown_iters":3000}

Novel Contributions

  • Hybrid SSM + sliding window attention architecture enabling near-constant-cost long-context training
  • Training at 32,768-token context, far longer than the standard 1,024-token baseline
  • Demonstration that step time remains nearly constant from 8K to 64K context
  • Score-first test-time training to improve post-quantization validation BPB
  • Compact int6 + zstd artifact under the 16 MB limit