PR #2066
openNon-record: Mamba2 SSM + Attention Hybrid (SP8192) - val_bpb= 1.1005 prequant, Research preview over 56 SSM runs, limitations and findings
by SarooshKhan897View on GitHub
val_bpb
1.1005
Architecture
Hybrid
Optimizer
Muon
Artifact Size
16,094,692 bytes
Training Techniques
Architecture
Hybrid
Mamba2/SSM backbone with a single attention block in a 10-layer model
parameters: {"layers":10,"attention_layers":[6],"embed_dim":512,"d_state":64,"headdim":64}
BigramHash
Bigram embedding/hash enabled in the submitted configuration
parameters: null
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_on":"large 2D matrices"}
AdamW
weight_decay: null
momentum: null
other_params: {"used_on":["scalar/control groups","embed","attention","Mamba projections in some ablations"]}
Quantization
mixed int4/int5/int6
bits: null
scope: FFN gate/down, Mamba, attention, embeddings
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"enabled":false}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Novel Contributions
- Documents a Mamba2 SSM + attention hybrid trained at SP8192 with strong pre-quant validation performance
- Shows that SSM-heavy hybrids can achieve competitive full-precision BPB but remain limited by artifact compression
- Provides a research chronicle of many SSM/attention hybrid ablations, including carrier-style SSMs, depth recurrence, and optimizer comparisons
- Demonstrates that aggressive low-bit quantization can preserve quality reasonably well but still miss the 16 MB cap
- Highlights Muon-heavy optimization as stronger than pure AdamW for this setup