PR #1643

closed

Non-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb

by mradassaadView on GitHub
val_bpb
1.1473
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15.93MB

Training Techniques

Architecture
Mamba
Hybrid Mamba-3 state-space model with attention layers inserted at positions 2 and 5.
parameters: {"layers":7,"attn_layers":2,"dim":512,"d_state":64,"expand":2,"headdim":64,"chunk_size":64,"mlp_mult":3}
GQA
Causal grouped-query attention with RoPE and GLU values.
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied embeddings used in the model.
parameters: null
U-Net skip connections
U-Net style skip connections in the hybrid architecture.
parameters: null
SmearGate
SmearGate component included in the model.
parameters: null
BigramHash
BigramHash feature used in the model.
parameters: null
LeakyReLU
LeakyReLU² hidden activation in the MLP.
parameters: null
Quantization
GPTQ
bits: 6
scope: weights
int8
bits: 8
scope: embeddings
late QAT
bits: null
scope: block weights and embeddings
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025}
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
stateful-overlap eval
parameters: {"overlap":1024}
Test-Time Training
score-first TTT
parameters: {"chunks":310,"chunk_tokens":32,"seq_len":4096,"learning_rate":0.01,"momentum":0.9,"epochs":1}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_iters":2600}
Regularization
weight decay
parameters: {"weight_decay":0.04}
logit softcap
parameters: null

Novel Contributions

  • Hybrid Mamba-3 SSM plus attention architecture
  • SP8192 tokenizer trained from scratch on FineWeb
  • INT8 embedding quantization with GPTQ on weights
  • Chunk score-first test-time training
  • Stateful-overlap evaluation for faster inference
  • QAT applied to Mamba-3 linear layers to reduce quantization gap
  • Pure Triton Mamba-3 kernel integration and profiling