PR #1829

open

Add non-record submission: HybridSSM SparseAttention MixedQuant

val_bpb

1.2047

Architecture

Hybrid

Optimizer

—

Artifact Size

15.57MB

Training Techniques

Architecture

Hybrid

Hybrid Transformer–SSM architecture with final-layer SSM replacement and selective attention removal in intermediate blocks.

parameters: {"layers":8,"model_dim":512,"num_heads":8,"num_kv_heads":4,"ssm_layers":7,"ssm_state_dim":128,"ssm_num_groups":8,"no_attn_layers":[3,6]}

GQA

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

MLP3x

Heterogeneous feedforward allocation with different MLP expansion ratios for transformer and SSM blocks.

parameters: {"transformer_mlp_mult":3.25,"ssm_mlp_mult":2}

Quantization

mixed int6/int8

bits: 6

scope: export

Compression

zstd

level: null

Sequence Length

sequence_length

train_length: 3072

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.3}