PR #1013
openNon-record: S4D-Lin SSM Hybrid — Fixing Why Mamba Failed in Parameter…
by himanshudongreView on GitHub
val_bpb
1.1682
Architecture
Hybrid
Optimizer
—
Artifact Size
13.0 MB
Training Techniques
Architecture
SSM
Replaced the lower transformer layers with S4D-Lin state-space model blocks using causal depthwise conv1d with learned exponentially decaying kernels.
parameters: {"layers":2,"kernel_size":64}
XSA
Used standard XSA attention in the upper layers of the hybrid model.
parameters: {"layers":9}
LeakyReLU
Standard LeakyReLU^2 MLP used in the transformer stack.
parameters: null
resid mix
Used x0-mixing / residual mixing in the block design.
parameters: null
Quantization
GPTQ
bits: 5
scope: full model
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Regularization
LN scale
parameters: {"ln_scale_factor":true}
Novel Contributions
- First functional SSM in Parameter Golf without throughput penalty
- S4D-Lin SSM blocks implemented with standard F.conv1d instead of Mamba selective scan
- Hybrid architecture combining lower-layer SSM blocks with upper-layer transformer attention
- Demonstrated that throughput can match the baseline while remaining torch.compile compatible
- Identified that attention outperforms SSM in lower layers at full competition scale
- Showed that GPTQ int5 quantization is sensitive for SSM weights
- Added ssm_proj handling to the quantization tensor classes