PR #1994

open

Non-record: First SSM entry — kill-Mamba-2 + Ternary + n=7 (1.30040)

by potatonyliuView on GitHub

val_bpb

1.3004

Architecture

Hybrid

Optimizer

Muon

Artifact Size

12.08 MB

Training Techniques

Architecture

depth recurrence

7 unique layers reused across 3 loops for effective depth 21.

parameters: {"layers":7,"loops":3,"effective_depth":21}

weight tying

Tied input/output embeddings.

parameters: null

GQA

Grouped-query attention with fewer KV heads than query heads.

parameters: {"heads":8,"kv_heads":4}

RoPE

Rotary positional embeddings used in attention.

parameters: null

Hybrid

Parallel attention and SSM branches in each block.

parameters: null

Mamba

kill-Mamba-2 SSM variant with LTI selectivity: dt/B/C replaced by learned constants while retaining conv1d and gated SSD scan.

parameters: {"d_state":64,"expand":2,"chunk_size":64,"headdim":64}

Quantization

ternary

bits: 2

scope: body weights

Weight Averaging

EMA

parameters: {"decay":0.999}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"matrix_lr":0.045,"backend_steps":15,"adamw_used":true}

Compression

brotli

level: 11

LR Schedule

warmdown

parameters: {"warmdown_iters":1800,"lr_warmup_steps":30}

Sequence Length

sequence_length

train_length: null

eval_length: null

Novel Contributions

First SSM-based entry in either track
kill-Mamba-2 linear-time-invariant SSM variant
Parallel attention and SSM within each block
Depth recurrence with 7 shared layers over 3 loops
BitNet-style ternary body-weight quantization with packed 2-bit export
EMA shadow-weight swap before final evaluation
Muon + AdamW optimizer split
Single-file submission with inlined quantization helpers