PR #1013

open

Non-record: S4D-Lin SSM Hybrid — Fixing Why Mamba Failed in Parameter…

by himanshudongreView on GitHub
val_bpb
1.1682
Architecture
Hybrid
Optimizer
Artifact Size
13.0 MB

Training Techniques

Architecture
SSM
Replaced the lower transformer layers with S4D-Lin state-space model blocks using causal depthwise conv1d with learned exponentially decaying kernels.
parameters: {"layers":2,"kernel_size":64}
XSA
Used standard XSA attention in the upper layers of the hybrid model.
parameters: {"layers":9}
LeakyReLU
Standard LeakyReLU^2 MLP used in the transformer stack.
parameters: null
resid mix
Used x0-mixing / residual mixing in the block design.
parameters: null
Quantization
GPTQ
bits: 5
scope: full model
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Regularization
LN scale
parameters: {"ln_scale_factor":true}

Novel Contributions

  • First functional SSM in Parameter Golf without throughput penalty
  • S4D-Lin SSM blocks implemented with standard F.conv1d instead of Mamba selective scan
  • Hybrid architecture combining lower-layer SSM blocks with upper-layer transformer attention
  • Demonstrated that throughput can match the baseline while remaining torch.compile compatible
  • Identified that attention outperforms SSM in lower layers at full competition scale
  • Showed that GPTQ int5 quantization is sensitive for SSM weights
  • Added ssm_proj handling to the quantization tensor classes