← Back to Architecture

Mamba

Architecture
Used in
10 PRs
Best BPB
1.1470
Avg BPB
1.3068

Hyperparameters Across PRs

pr_numberparameters
914
1107{"layers":8,"mamba_layers":7,"attention_layers":1,"dim":512,"d_state":64,"mlp_mult":3,"seq_len":4096}
1245
1342{"layers":12,"d_model":512,"d_inner":1024,"d_state":64,"d_conv":4,"headdim":64}
1355{"layers":7,"dim":512,"d_state":64,"seq_len":4096}
1524{"layer":5,"d_model":512,"d_state":64,"d_conv":4,"expand":2}
1525{"layer":5,"d_model":512,"d_state":64,"d_conv":4,"expand":2}
1574{"d_model":640,"d_inner":1280,"d_state":34,"d_conv":4,"num_layers":8,"head_adapter_rank":16,"vocab_size":1056}
1643{"layers":7,"attn_layers":2,"dim":512,"d_state":64,"expand":2,"headdim":64,"chunk_size":64,"mlp_mult":3}
1757{"outer_layers":1,"layers":9,"encoder_layers":1,"main_layers":7,"decoder_layers":1,"kv_heads":4,"heads":8,"dim":512}