PR #1355

open

Non-record: Mamba-3 Hybrid + Full Hessian GPTQ + Late QAT — val_bpb 1.1526

by mradassaadView on GitHub

val_bpb

1.1526

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15.78MB

Training Techniques

Architecture

Mamba

7-layer Mamba-3 SISO SSD blocks in an 8-layer hybrid model with one attention layer.

parameters: {"layers":7,"dim":512,"d_state":64,"seq_len":4096}

attention

Single attention hybrid layer inserted into the Mamba stack.

parameters: {"layers":1,"position":4,"heads":"8/4"}

weight tying

Tied embeddings used in the model.

parameters: null

U-Net skip connections

U-Net style skip connections in the hybrid architecture.

parameters: null

BigramHash

Bigram hash feature used in the model.

parameters: null

SmearGate

SmearGate component included in the architecture.

parameters: null

LeakyReLU

LeakyReLU squared MLP activation.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025}

Quantization

GPTQ

bits: 6

scope: all

late QAT

bits: null

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":32}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

LR Schedule

warmdown

parameters: {"warmdown_steps":3500,"shape":"linear"}

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Full Hessian GPTQ replacing QAT-only quantization
Late QAT triggered only in the final low-learning-rate training phase
Linear warmdown schedule that enables late QAT to activate properly
Mamba-3 hybrid architecture with one attention layer
LZMA compression to fit the artifact under the 16MB limit
Closing the quantization gap from 174 mBPB to effectively zero