PR #220

open

[WIP] SSM LRU Baseline — First State Space Model Submission

by timothywangdevView on GitHub

val_bpb

1.8480

Architecture

Linear Recurrent Unit (LRU) state space model

Optimizer

MuonAdamW

Artifact Size

16MB

Training Techniques

Architecture

LRU / state space model

Replaces transformer attention with a Linear Recurrent Unit state space model using complex diagonal recurrence.

parameters: null

parallel scan

Uses a cumulative-sum trick in log-space for parallel recurrence computation, intended to be torch.compile friendly.

parameters: null

gated projection

Applies a sigmoid gate on the SSM output.

parameters: null

ReLU^2 MLP

Uses a ReLU-squared MLP similar to the transformer baseline.

parameters: null

Optimizer

MuonAdamW

weight_decay: null

momentum: null

other_params: {"param_groups":"SSM-aware parameter groups; Adam for A/B/C/D and Muon for projections"}

Evaluation

sliding window eval

parameters: {"no_recomputation":true}

Other

other

Uses SSM-specific parameter grouping where A/B/C/D are optimized separately from projection layers.

parameters: null

First non-transformer submission to parameter golf using an LRU state space model
Complex diagonal recurrence with parallel scan in log-space
SSM blocks claimed to be smaller than attention blocks at equivalent dimension
SSMs can absorb the MLP, reducing block size
No KV cache with native sliding window evaluation
MuonAdamW with SSM-aware parameter groups