PR #1355
openNon-record: Mamba-3 Hybrid + Full Hessian GPTQ + Late QAT — val_bpb 1.1526
by mradassaadView on GitHub
val_bpb
1.1526
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15.78MB
Training Techniques
Architecture
Mamba
7-layer Mamba-3 SISO SSD blocks in an 8-layer hybrid model with one attention layer.
parameters: {"layers":7,"dim":512,"d_state":64,"seq_len":4096}
attention
Single attention hybrid layer inserted into the Mamba stack.
parameters: {"layers":1,"position":4,"heads":"8/4"}
weight tying
Tied embeddings used in the model.
parameters: null
U-Net skip connections
U-Net style skip connections in the hybrid architecture.
parameters: null
BigramHash
Bigram hash feature used in the model.
parameters: null
SmearGate
SmearGate component included in the architecture.
parameters: null
LeakyReLU
LeakyReLU squared MLP activation.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025}
Quantization
GPTQ
bits: 6
scope: all
late QAT
bits: null
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":32}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_steps":3500,"shape":"linear"}
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- Full Hessian GPTQ replacing QAT-only quantization
- Late QAT triggered only in the final low-learning-rate training phase
- Linear warmdown schedule that enables late QAT to activate properly
- Mamba-3 hybrid architecture with one attention layer
- LZMA compression to fit the artifact under the 16MB limit
- Closing the quantization gap from 174 mBPB to effectively zero