PR #220
open[WIP] SSM LRU Baseline — First State Space Model Submission
by timothywangdevView on GitHub
val_bpb
1.8480
Architecture
Linear Recurrent Unit (LRU) state space model
Optimizer
MuonAdamW
Artifact Size
16MB
Training Techniques
Architecture
LRU / state space model
Replaces transformer attention with a Linear Recurrent Unit state space model using complex diagonal recurrence.
parameters: null
parallel scan
Uses a cumulative-sum trick in log-space for parallel recurrence computation, intended to be torch.compile friendly.
parameters: null
gated projection
Applies a sigmoid gate on the SSM output.
parameters: null
ReLU^2 MLP
Uses a ReLU-squared MLP similar to the transformer baseline.
parameters: null
Optimizer
MuonAdamW
weight_decay: null
momentum: null
other_params: {"param_groups":"SSM-aware parameter groups; Adam for A/B/C/D and Muon for projections"}
Evaluation
sliding window eval
parameters: {"no_recomputation":true}
Other
other
Uses SSM-specific parameter grouping where A/B/C/D are optimized separately from projection layers.
parameters: null
Novel Contributions
- First non-transformer submission to parameter golf using an LRU state space model
- Complex diagonal recurrence with parallel scan in log-space
- SSM blocks claimed to be smaller than attention blocks at equivalent dimension
- SSMs can absorb the MLP, reducing block size
- No KV cache with native sliding window evaluation
- MuonAdamW with SSM-aware parameter groups