PR #2155
openfeat(non_record): add SP8192 BPE Mamba3 SSM hybrid 16MB non-record submission
by divagr18View on GitHub
val_bpb
1.2542
Architecture
Hybrid
Optimizer
Muon
Artifact Size
16MB
Training Techniques
Architecture
GQA
Hybrid architecture replacing every 4th transformer attention block with a Mamba3 state-space model layer.
parameters: {"layers":9,"heads":8,"kv_heads":4,"model_dim":448,"ssm_every_n":4,"ssm_blocks":2}
weight tying
Tied input and output embeddings.
parameters: null
GQA
Uses grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_for_scalar_params":true}
Weight Averaging
SWA
parameters: null
Quantization
GPTQ
bits: 8
scope: all
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride_frac":0.5}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Uses the official Mamba3 CUDA extension as the state-space model implementation.
parameters: {"impl":"mamba3","head_dim":64,"d_state":128,"expand":2,"mimo_rank":4}
Novel Contributions
- Hybrid Mamba3 SSM + attention architecture for a non-record 16MB submission
- Replaces every 4th transformer attention block with Mamba3 to reduce parameter count
- Uses SP8192 SentencePiece BPE tokenizer and exported dataset setup
- Combines Muon for matrix parameters with Adam for scalar parameters
- Applies SWA and GPTQ int8 quantization with zstd compression
- Includes official Mamba CUDA extension integration for efficient SSM layers