PR #2155

open

feat(non_record): add SP8192 BPE Mamba3 SSM hybrid 16MB non-record submission

by divagr18View on GitHub

val_bpb

1.2542

Architecture

Hybrid

Optimizer

Muon

Artifact Size

16MB

Training Techniques

Architecture

GQA

Hybrid architecture replacing every 4th transformer attention block with a Mamba3 state-space model layer.

parameters: {"layers":9,"heads":8,"kv_heads":4,"model_dim":448,"ssm_every_n":4,"ssm_blocks":2}

weight tying

Tied input and output embeddings.

parameters: null

GQA

Uses grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adam_for_scalar_params":true}

Weight Averaging

SWA

parameters: null

Quantization

GPTQ

bits: 8

scope: all

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride_frac":0.5}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Uses the official Mamba3 CUDA extension as the state-space model implementation.

parameters: {"impl":"mamba3","head_dim":64,"d_state":128,"expand":2,"mimo_rank":4}

Novel Contributions

Hybrid Mamba3 SSM + attention architecture for a non-record 16MB submission
Replaces every 4th transformer attention block with Mamba3 to reduce parameter count
Uses SP8192 SentencePiece BPE tokenizer and exported dataset setup
Combines Muon for matrix parameters with Adam for scalar parameters
Applies SWA and GPTQ int8 quantization with zstd compression
Includes official Mamba CUDA extension integration for efficient SSM layers