PR #1342

open

Submission/mamba ssm byte260

by nicholasbailey87View on GitHub

val_bpb

1.4816

Architecture

Mamba

Optimizer

Muon

Artifact Size

~16.6-16.9MB

Training Techniques

Architecture

U-Net skip connections

Encoder-decoder style skip connections with learned skip weights across the 12-block Mamba-2 SSD model.

parameters: {"layers":12}

weight tying

Tied input embeddings and output logits to reduce parameter count.

parameters: null

Mamba

Modified Mamba-2 SSD architecture with an Achilles' Heel residual bypass around the convolution.

parameters: {"layers":12,"d_model":512,"d_inner":1024,"d_state":64,"d_conv":4,"headdim":64}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"used_for":"matrix params"}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"scalars/embeddings/SSM params"}

Regularization

logit softcap

parameters: {"value":30}

Compression

zlib

level: null

Other

other

Byte-level tokenization with vocab size 260 (4 special tokens + 256 UTF-8 bytes).

parameters: {"vocab_size":260}

other

Pure-PyTorch selective scan implementation without custom CUDA kernels, relying on torch.compile.

parameters: null