PR #2154
closedfeat(records): add SP8192 BPE Mamba3 SSM hybrid 16MB non-record submission
by divagr18View on GitHub
val_bpb
1.2542
Architecture
Hybrid
Optimizer
Muon
Artifact Size
16MB
Training Techniques
Architecture
GQA
Hybrid transformer architecture replacing every 4th attention block with a Mamba3 state-space model layer
parameters: {"layers":9,"heads":8,"kv_heads":4,"model_dim":448,"ssm_every_n":4}
Mamba
Mamba3 SSM used as a drop-in mixer replacement for selected blocks
parameters: {"expand":2,"d_state":128,"head_dim":64,"mimo_rank":4}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_for_scalar_params":true}
Weight Averaging
SWA
parameters: null
Quantization
GPTQ
bits: 8
scope: model
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride_frac":0.5}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
SentencePiece BPE tokenizer with 8192 vocabulary size
parameters: {"vocab_size":8192}
Novel Contributions
- Non-record 16MB submission centered on SP8192 BPE
- Hybrid architecture replacing every 4th transformer attention block with Mamba3 SSM
- Use of official Mamba CUDA extension for efficient state-space layers
- Muon + Adam optimizer combination with SWA
- GPTQ int8 quantization with zstd compression
- Detailed packaged run with training script, log, metadata, dependencies, and tokenizer assets