PR #1107

open

Mamba-3 SSD + Attention Hybrid with QAT (1.5633 bpb)

by mradassaadView on GitHub

val_bpb

1.5633

Architecture

Hybrid

Optimizer

Muon

Artifact Size

10.9MB

Training Techniques

Architecture

Mamba

Hybrid model using 7 Mamba-3 SISO SSD blocks and 1 attention layer in an 8-layer stack

parameters: {"layers":8,"mamba_layers":7,"attention_layers":1,"dim":512,"d_state":64,"mlp_mult":3,"seq_len":4096}

GQA

Causal grouped-query attention with 8 heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

RoPE

Rotary positional embeddings used in attention

parameters: null

LeakyReLU

LeakyReLU squared hidden activation in the MLP

parameters: null

U-Net skip connections

U-Net style skip connections included in the hybrid architecture

parameters: null

SmearGate

SmearGate component used in the model

parameters: null

BigramHash

BigramHash component used in the model

parameters: null

weight tying

Tied embeddings used

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025}

Quantization

QAT

bits: 6

scope: Mamba projections and standard CastedLinear layers

Compression

zlib

level: 9

Test-Time Training

full TTT

parameters: {"epochs":1}

LR Schedule

warmdown

parameters: {"warmdown_iters":22000}

Sequence Length

sequence_length

train_length: 4096

eval_length: null

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Mamba-3 SISO SSD + attention hybrid architecture for parameter golf
QAT applied to Mamba-3 projections by replacing nn.Linear projections with CastedLinear
Hardware-dependent warmdown schedule fix based on step time
Demonstration that fewer attention layers and smaller MLPs win under a fixed wall-clock budget
GLU values in attention provided the largest ablation gain