PR #1829
openAdd non-record submission: HybridSSM SparseAttention MixedQuant
by estesryanView on GitHub
val_bpb
1.2047
Architecture
Hybrid
Optimizer
—
Artifact Size
15.57MB
Training Techniques
Architecture
Hybrid
Hybrid Transformer–SSM architecture with final-layer SSM replacement and selective attention removal in intermediate blocks.
parameters: {"layers":8,"model_dim":512,"num_heads":8,"num_kv_heads":4,"ssm_layers":7,"ssm_state_dim":128,"ssm_num_groups":8,"no_attn_layers":[3,6]}
GQA
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
MLP3x
Heterogeneous feedforward allocation with different MLP expansion ratios for transformer and SSM blocks.
parameters: {"transformer_mlp_mult":3.25,"ssm_mlp_mult":2}
Quantization
mixed int6/int8
bits: 6
scope: export
Compression
zstd
level: null
Sequence Length
sequence_length
train_length: 3072
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.3}
Novel Contributions
- Hybrid Transformer–SSM architecture
- Final-layer SSM substitution
- Selective attention removal in intermediate blocks
- Heterogeneous feedforward allocation between Transformer and SSM blocks
- Mixed int8/int6 post-training quantized export
- Compressed roundtrip validation on the final artifact