PR #1268

open

Non-record: Mamba3 Hybrid + GPTQ Long Context (1.1875 BPB)

by samquiringView on GitHub
val_bpb
1.1875
Architecture
Hybrid
Optimizer
Artifact Size
15.51 MB

Training Techniques

Architecture
U-Net skip connections
U-Net style skip connections across the hybrid encoder-decoder architecture.
parameters: null
GQA
Grouped query attention layers used in the hybrid model.
parameters: {"layers":2}
ReLU²
ReLU squared MLP activation in the Mamba3 hybrid blocks.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 16384
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Regularization
logit softcap
parameters: null
Other
other
Full Hessian-based GPTQ post-training quantization with column reordering and percentile clip search.
parameters: {"block_size":128,"calibration_tokens":131072}

Novel Contributions

  • Mamba3 hybrid architecture combining 8 Mamba3 SISO layers with 2 GQA attention layers
  • U-Net skip connections in a Mamba hybrid model
  • EMA weight averaging to improve GPTQ robustness
  • Full Hessian GPTQ post-training quantization to fit under the 16MB limit
  • Long-context training at sequence length 16384 to exploit Mamba's linear-time scaling
  • Autoregressive self-generated calibration data for GPTQ