PR #1890

open

Non-record: Mamba-3 Hybrid + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)

by mradassaadView on GitHub

val_bpb

1.1456

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15.93 MB

Training Techniques

Test-Time Training

full TTT

parameters: {"epochs":2}

Quantization

mixed int6/int8

bits: 6

scope: all weights with SSM dynamics rows at int8

GPTQ

bits: 6

scope: weights

late QAT

bits: null

scope: post-quant adaptation

Regularization

weight decay

parameters: {"weight_decay":0.04}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025,"muon_eq_r":1}

Evaluation

sliding window eval

parameters: {"overlap":1024}

Compression

lzma

level: null

Sequence Length

sequence_length

train_length: 4096

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":2600,"shape":"linear"}

Architecture

Hybrid

7-layer Mamba-3/Attention hybrid with 5 SSM blocks and 2 FlashAttention layers at positions 2 and 5

parameters: {"layers":7,"ssm_blocks":5,"attention_layers":2,"dim":512,"d_state":64,"expand":2,"headdim":64,"chunk_size":64,"mlp_mult":3}

Novel Contributions

Multi-epoch TTT with TTT_EPOCHS=2 to reduce post-quant regression
Mixed-precision SSM dynamics protection by quantizing dd_A and dd_dt rows at int8 while keeping other rows at int6
Fix for scale-floor quantization bug so int6 rows use the correct 1/qmax floor
Use of LZMA-based artifact compression for the final submission