PR #1268

open

Non-record: Mamba3 Hybrid + GPTQ Long Context (1.1875 BPB)

by samquiringView on GitHub

val_bpb

1.1875

Architecture

Hybrid

Optimizer

—

Artifact Size

15.51 MB

Training Techniques

Architecture

U-Net skip connections

U-Net style skip connections across the hybrid encoder-decoder architecture.

parameters: null

GQA

Grouped query attention layers used in the hybrid model.

parameters: {"layers":2}

ReLU²

ReLU squared MLP activation in the Mamba3 hybrid blocks.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: null

Sequence Length

sequence_length

train_length: 16384

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Regularization

logit softcap

parameters: null

Other

other

Full Hessian-based GPTQ post-training quantization with column reordering and percentile clip search.

parameters: {"block_size":128,"calibration_tokens":131072}

Novel Contributions

Mamba3 hybrid architecture combining 8 Mamba3 SISO layers with 2 GQA attention layers
U-Net skip connections in a Mamba hybrid model
EMA weight averaging to improve GPTQ robustness
Full Hessian GPTQ post-training quantization to fit under the 16MB limit
Long-context training at sequence length 16384 to exploit Mamba's linear-time scaling
Autoregressive self-generated calibration data for GPTQ