PR #2154

closed

feat(records): add SP8192 BPE Mamba3 SSM hybrid 16MB non-record submission

by divagr18View on GitHub
val_bpb
1.2542
Architecture
Hybrid
Optimizer
Muon
Artifact Size
16MB

Training Techniques

Architecture
GQA
Hybrid transformer architecture replacing every 4th attention block with a Mamba3 state-space model layer
parameters: {"layers":9,"heads":8,"kv_heads":4,"model_dim":448,"ssm_every_n":4}
Mamba
Mamba3 SSM used as a drop-in mixer replacement for selected blocks
parameters: {"expand":2,"d_state":128,"head_dim":64,"mimo_rank":4}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_for_scalar_params":true}
Weight Averaging
SWA
parameters: null
Quantization
GPTQ
bits: 8
scope: model
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride_frac":0.5}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
SentencePiece BPE tokenizer with 8192 vocabulary size
parameters: {"vocab_size":8192}

Novel Contributions

  • Non-record 16MB submission centered on SP8192 BPE
  • Hybrid architecture replacing every 4th transformer attention block with Mamba3 SSM
  • Use of official Mamba CUDA extension for efficient state-space layers
  • Muon + Adam optimizer combination with SWA
  • GPTQ int8 quantization with zstd compression
  • Detailed packaged run with training script, log, metadata, dependencies, and tokenizer assets