PR #220

open

[WIP] SSM LRU Baseline — First State Space Model Submission

by timothywangdevView on GitHub
val_bpb
1.8480
Architecture
Linear Recurrent Unit (LRU) state space model
Optimizer
MuonAdamW
Artifact Size
16MB

Training Techniques

Architecture
LRU / state space model
Replaces transformer attention with a Linear Recurrent Unit state space model using complex diagonal recurrence.
parameters: null
parallel scan
Uses a cumulative-sum trick in log-space for parallel recurrence computation, intended to be torch.compile friendly.
parameters: null
gated projection
Applies a sigmoid gate on the SSM output.
parameters: null
ReLU^2 MLP
Uses a ReLU-squared MLP similar to the transformer baseline.
parameters: null
Optimizer
MuonAdamW
weight_decay: null
momentum: null
other_params: {"param_groups":"SSM-aware parameter groups; Adam for A/B/C/D and Muon for projections"}
Evaluation
sliding window eval
parameters: {"no_recomputation":true}
Other
other
Uses SSM-specific parameter grouping where A/B/C/D are optimized separately from projection layers.
parameters: null

Novel Contributions

  • First non-transformer submission to parameter golf using an LRU state space model
  • Complex diagonal recurrence with parallel scan in log-space
  • SSM blocks claimed to be smaller than attention blocks at equivalent dimension
  • SSMs can absorb the MLP, reducing block size
  • No KV cache with native sliding window evaluation
  • MuonAdamW with SSM-aware parameter groups