PR #1829

open

Add non-record submission: HybridSSM SparseAttention MixedQuant

by estesryanView on GitHub
val_bpb
1.2047
Architecture
Hybrid
Optimizer
Artifact Size
15.57MB

Training Techniques

Architecture
Hybrid
Hybrid Transformer–SSM architecture with final-layer SSM replacement and selective attention removal in intermediate blocks.
parameters: {"layers":8,"model_dim":512,"num_heads":8,"num_kv_heads":4,"ssm_layers":7,"ssm_state_dim":128,"ssm_num_groups":8,"no_attn_layers":[3,6]}
GQA
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
MLP3x
Heterogeneous feedforward allocation with different MLP expansion ratios for transformer and SSM blocks.
parameters: {"transformer_mlp_mult":3.25,"ssm_mlp_mult":2}
Quantization
mixed int6/int8
bits: 6
scope: export
Compression
zstd
level: null
Sequence Length
sequence_length
train_length: 3072
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.3}

Novel Contributions

  • Hybrid Transformer–SSM architecture
  • Final-layer SSM substitution
  • Selective attention removal in intermediate blocks
  • Heterogeneous feedforward allocation between Transformer and SSM blocks
  • Mixed int8/int6 post-training quantized export
  • Compressed roundtrip validation on the final artifact