PR #1574

open

[Non-Record] SSM8: Fat State Mamba SSM, BPB=1.3587

by KRGulajView on GitHub

val_bpb

1.3587

Architecture

Mamba

Optimizer

Muon

Artifact Size

15.93 MB

Training Techniques

Architecture

Mamba

Pure recurrent Mamba state space model baseline without attention layers.

parameters: {"d_model":640,"d_inner":1280,"d_state":34,"d_conv":4,"num_layers":8,"head_adapter_rank":16,"vocab_size":1056}

weight tying

Tied embedding logits with a low-rank head adapter on top.

parameters: null

other

Fat State design using d_state=34 to increase recurrent memory capacity.

parameters: {"d_state":34}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"used_for":"2D weight matrices"}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings and scalar parameters","fused":true}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ-lite

bits: 6

scope: all weights; embeddings int8

QAT

bits: 6

scope: weights

Compression

zstd

level: 22

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Other

other

Score-first test-time training updates LoRA adapters on previous window tokens before scoring the current window.

parameters: null

other

Online entropy-based data filtering using a zlib compression ratio heuristic at batch load time.

parameters: {"thresholds":[4,2.5,2,1.8]}

LR Schedule

warmup + stable + cosine decay

parameters: {"warmup":0.1,"stable":0.7,"decay":0.2}

Regularization

gradient checkpointing

parameters: null

Novel Contributions

Pure Mamba SSM baseline within the 16 MB / 10-minute constraint
Fat State design with d_state=34 for richer recurrent memory
Score-First LoRA TTT adapted for recurrent architectures
Entropy-based data filtering heuristic at batch load time
GPTQ-lite int6 compression pipeline with zstd artifact compression