PR #1342

open

Submission/mamba ssm byte260

by nicholasbailey87View on GitHub
val_bpb
1.4816
Architecture
Mamba
Optimizer
Muon
Artifact Size
~16.6-16.9MB

Training Techniques

Architecture
U-Net skip connections
Encoder-decoder style skip connections with learned skip weights across the 12-block Mamba-2 SSD model.
parameters: {"layers":12}
weight tying
Tied input embeddings and output logits to reduce parameter count.
parameters: null
Mamba
Modified Mamba-2 SSD architecture with an Achilles' Heel residual bypass around the convolution.
parameters: {"layers":12,"d_model":512,"d_inner":1024,"d_state":64,"d_conv":4,"headdim":64}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_for":"matrix params"}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"scalars/embeddings/SSM params"}
Regularization
logit softcap
parameters: {"value":30}
Compression
zlib
level: null
Other
other
Byte-level tokenization with vocab size 260 (4 special tokens + 256 UTF-8 bytes).
parameters: {"vocab_size":260}
other
Pure-PyTorch selective scan implementation without custom CUDA kernels, relying on torch.compile.
parameters: null

Novel Contributions

  • First state-space model submission to Parameter Golf
  • Modified Mamba-2 SSD with Achilles' Heel residual bypass fix
  • Byte-level tokenization with vocab size 260 to reduce embedding overhead
  • U-Net skip connections in a Mamba-based architecture
  • Pure-PyTorch selective scan implementation