PR #1644
openNon-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb
by mradassaadView on GitHub
val_bpb
1.1473
Architecture
Mamba
Optimizer
Muon
Artifact Size
15.93MB
Training Techniques
Architecture
Hybrid
7-layer Mamba-3 hybrid SSM with 5 SSM blocks and 2 FlashAttention layers placed at positions 2 and 5.
parameters: {"layers":7,"ssm_blocks":5,"attention_layers":2,"dim":512,"d_state":64,"expand":2,"headdim":64,"chunk_size":64,"mlp_mult":3}
Quantization
GPTQ
bits: 6
scope: all
int8
bits: 8
scope: embeddings
late QAT
bits: null
scope: all
Test-Time Training
score-first TTT
parameters: {"chunks":310,"chunk_tokens":32768,"learning_rate":0.01,"momentum":0.9,"epochs":1}
Evaluation
stateful-overlap eval
parameters: {"overlap":1024}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_steps":2600,"shape":"linear"}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_eq_r":1,"matrix_lr":0.025}
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
weight decay
parameters: {"value":0.04}
Compression
lzma
level: null
Other
other
SP8192 SentencePiece BPE tokenizer trained from scratch on FineWeb/docs_selected.jsonl.
parameters: {"vocab_size":8192}
Novel Contributions
- Hybrid Mamba-3 SSM with interleaved FlashAttention layers
- SP8192 tokenizer trained from scratch
- INT8 embeddings plus GPTQ quantization for a compact artifact
- Chunk score-first test-time training
- Stateful-overlap evaluation with SSM state carryover
- Mamba-3 Triton kernel profiling and autotuning analysis