PR #1607
openNon-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB)
by inin-zouView on GitHub
val_bpb
1.4765
Architecture
Hybrid
Optimizer
—
Artifact Size
8.2MB
Training Techniques
Architecture
depth recurrence
Hinge-point multi-recurrence repeats the U-Net hinge layers to create more virtual layers with zero extra parameters.
parameters: {"layers":[3,4],"repeats":2,"virtual_layers":12,"physical_layers":8}
U-Net skip connections
Uses a U-Net encoder-decoder style hybrid with skip connections.
parameters: null
GQA
Attention layer uses grouped query attention.
parameters: {"heads":8,"kv_heads":4}
RoPE
Attention uses RoPE positional encoding with base 10000.
parameters: {"base":10000}
depth recurrence
Recurrence is enabled partway through training to allow initial convergence before recurrence is applied.
parameters: {"start_frac":0.35,"start_step":350}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: 9
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Hybrid architecture combining Mamba-3 SSM layers with one attention layer in a Nemotron-H inspired layout.
parameters: {"mamba_layers":7,"attention_layers":1}
Novel Contributions
- First Mamba depth recurrence in the competition
- First hinge-point multi-recurrence in the competition
- Nemotron-H inspired hybrid of Mamba-3 and attention
- Focused hinge-point recurrence outperforms spread recurrence
- Systematic ablation of recurrence, quantization, and architectural variants
- Demonstration that standard GPTQ handles SSM outliers well at this scale