PR #1890
openNon-record: Mamba-3 Hybrid + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)
by mradassaadView on GitHub
val_bpb
1.1456
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15.93 MB
Training Techniques
Test-Time Training
full TTT
parameters: {"epochs":2}
Quantization
mixed int6/int8
bits: 6
scope: all weights with SSM dynamics rows at int8
GPTQ
bits: 6
scope: weights
late QAT
bits: null
scope: post-quant adaptation
Regularization
weight decay
parameters: {"weight_decay":0.04}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"muon_eq_r":1}
Evaluation
sliding window eval
parameters: {"overlap":1024}
Compression
lzma
level: null
Sequence Length
sequence_length
train_length: 4096
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":2600,"shape":"linear"}
Architecture
Hybrid
7-layer Mamba-3/Attention hybrid with 5 SSM blocks and 2 FlashAttention layers at positions 2 and 5
parameters: {"layers":7,"ssm_blocks":5,"attention_layers":2,"dim":512,"d_state":64,"expand":2,"headdim":64,"chunk_size":64,"mlp_mult":3}
Novel Contributions
- Multi-epoch TTT with TTT_EPOCHS=2 to reduce post-quant regression
- Mixed-precision SSM dynamics protection by quantizing dd_A and dd_dt rows at int8 while keeping other rows at int6
- Fix for scale-floor quantization bug so int6 rows use the correct 1/qmax floor
- Use of LZMA-based artifact compression for the final submission