PR #1643

closed

Non-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb

by mradassaadView on GitHub

val_bpb

1.1473

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15.93MB

Training Techniques

Architecture

Mamba

Hybrid Mamba-3 state-space model with attention layers inserted at positions 2 and 5.

parameters: {"layers":7,"attn_layers":2,"dim":512,"d_state":64,"expand":2,"headdim":64,"chunk_size":64,"mlp_mult":3}

GQA

Causal grouped-query attention with RoPE and GLU values.

parameters: {"heads":8,"kv_heads":4}

weight tying

Tied embeddings used in the model.

parameters: null

U-Net skip connections

U-Net style skip connections in the hybrid architecture.

parameters: null

SmearGate

SmearGate component included in the model.

parameters: null

BigramHash

BigramHash feature used in the model.

parameters: null

LeakyReLU

LeakyReLU² hidden activation in the MLP.

parameters: null

Quantization

GPTQ

bits: 6

scope: weights

int8

bits: 8

scope: embeddings

late QAT

bits: null

scope: block weights and embeddings

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025}

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

stateful-overlap eval

parameters: {"overlap":1024}

Test-Time Training

score-first TTT

parameters: {"chunks":310,"chunk_tokens":32,"seq_len":4096,"learning_rate":0.01,"momentum":0.9,"epochs":1}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

LR Schedule

warmdown

parameters: {"warmdown_iters":2600}

Regularization

weight decay

parameters: {"weight_decay":0.04}

logit softcap

parameters: null

Novel Contributions

Hybrid Mamba-3 SSM plus attention architecture
SP8192 tokenizer trained from scratch on FineWeb
INT8 embedding quantization with GPTQ on weights
Chunk score-first test-time training
Stateful-overlap evaluation for faster inference
QAT applied to Mamba-3 linear layers to reduce quantization gap
Pure Triton Mamba-3 kernel integration and profiling