PR #1994
openNon-record: First SSM entry — kill-Mamba-2 + Ternary + n=7 (1.30040)
by potatonyliuView on GitHub
val_bpb
1.3004
Architecture
Hybrid
Optimizer
Muon
Artifact Size
12.08 MB
Training Techniques
Architecture
depth recurrence
7 unique layers reused across 3 loops for effective depth 21.
parameters: {"layers":7,"loops":3,"effective_depth":21}
weight tying
Tied input/output embeddings.
parameters: null
GQA
Grouped-query attention with fewer KV heads than query heads.
parameters: {"heads":8,"kv_heads":4}
RoPE
Rotary positional embeddings used in attention.
parameters: null
Hybrid
Parallel attention and SSM branches in each block.
parameters: null
Mamba
kill-Mamba-2 SSM variant with LTI selectivity: dt/B/C replaced by learned constants while retaining conv1d and gated SSD scan.
parameters: {"d_state":64,"expand":2,"chunk_size":64,"headdim":64}
Quantization
ternary
bits: 2
scope: body weights
Weight Averaging
EMA
parameters: {"decay":0.999}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.045,"backend_steps":15,"adamw_used":true}
Compression
brotli
level: 11
LR Schedule
warmdown
parameters: {"warmdown_iters":1800,"lr_warmup_steps":30}
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- First SSM-based entry in either track
- kill-Mamba-2 linear-time-invariant SSM variant
- Parallel attention and SSM within each block
- Depth recurrence with 7 shared layers over 3 loops
- BitNet-style ternary body-weight quantization with packed 2-bit export
- EMA shadow-weight swap before final evaluation
- Muon + AdamW optimizer split
- Single-file submission with inlined quantization helpers