val_bpb
1.5633
Architecture
Hybrid
Optimizer
Muon
Artifact Size
10.9MB
Training Techniques
Architecture
Mamba
Hybrid model using 7 Mamba-3 SISO SSD blocks and 1 attention layer in an 8-layer stack
parameters: {"layers":8,"mamba_layers":7,"attention_layers":1,"dim":512,"d_state":64,"mlp_mult":3,"seq_len":4096}
GQA
Causal grouped-query attention with 8 heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
RoPE
Rotary positional embeddings used in attention
parameters: null
LeakyReLU
LeakyReLU squared hidden activation in the MLP
parameters: null
U-Net skip connections
U-Net style skip connections included in the hybrid architecture
parameters: null
SmearGate
SmearGate component used in the model
parameters: null
BigramHash
BigramHash component used in the model
parameters: null
weight tying
Tied embeddings used
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025}
Quantization
QAT
bits: 6
scope: Mamba projections and standard CastedLinear layers
Compression
zlib
level: 9
Test-Time Training
full TTT
parameters: {"epochs":1}
LR Schedule
warmdown
parameters: {"warmdown_iters":22000}
Sequence Length
sequence_length
train_length: 4096
eval_length: null
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- Mamba-3 SISO SSD + attention hybrid architecture for parameter golf
- QAT applied to Mamba-3 projections by replacing nn.Linear projections with CastedLinear
- Hardware-dependent warmdown schedule fix based on step time
- Demonstration that fewer attention layers and smaller MLPs win under a fixed wall-clock budget
- GLU values in attention provided the largest ablation gain