PR #1013

open

Non-record: S4D-Lin SSM Hybrid — Fixing Why Mamba Failed in Parameter…

by himanshudongreView on GitHub

val_bpb

1.1682

Architecture

Hybrid

Optimizer

—

Artifact Size

13.0 MB

Training Techniques

Architecture

SSM

Replaced the lower transformer layers with S4D-Lin state-space model blocks using causal depthwise conv1d with learned exponentially decaying kernels.

parameters: {"layers":2,"kernel_size":64}

XSA

Used standard XSA attention in the upper layers of the hybrid model.

parameters: {"layers":9}

LeakyReLU

Standard LeakyReLU^2 MLP used in the transformer stack.

parameters: null

resid mix

Used x0-mixing / residual mixing in the block design.

parameters: null

Quantization

GPTQ

bits: 5

scope: full model

Compression

lzma

level: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Regularization

LN scale

parameters: {"ln_scale_factor":true}

Novel Contributions

First functional SSM in Parameter Golf without throughput penalty
S4D-Lin SSM blocks implemented with standard F.conv1d instead of Mamba selective scan
Hybrid architecture combining lower-layer SSM blocks with upper-layer transformer attention
Demonstrated that throughput can match the baseline while remaining torch.compile compatible
Identified that attention outperforms SSM in lower layers at full competition scale
Showed that GPTQ int5 quantization is sensitive for SSM weights
Added ssm_proj handling to the quantization tensor classes