PR #1607

open

Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB)

by inin-zouView on GitHub
val_bpb
1.4765
Architecture
Hybrid
Optimizer
Artifact Size
8.2MB

Training Techniques

Architecture
depth recurrence
Hinge-point multi-recurrence repeats the U-Net hinge layers to create more virtual layers with zero extra parameters.
parameters: {"layers":[3,4],"repeats":2,"virtual_layers":12,"physical_layers":8}
U-Net skip connections
Uses a U-Net encoder-decoder style hybrid with skip connections.
parameters: null
GQA
Attention layer uses grouped query attention.
parameters: {"heads":8,"kv_heads":4}
RoPE
Attention uses RoPE positional encoding with base 10000.
parameters: {"base":10000}
depth recurrence
Recurrence is enabled partway through training to allow initial convergence before recurrence is applied.
parameters: {"start_frac":0.35,"start_step":350}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: 9
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Hybrid architecture combining Mamba-3 SSM layers with one attention layer in a Nemotron-H inspired layout.
parameters: {"mamba_layers":7,"attention_layers":1}

Novel Contributions

  • First Mamba depth recurrence in the competition
  • First hinge-point multi-recurrence in the competition
  • Nemotron-H inspired hybrid of Mamba-3 and attention
  • Focused hinge-point recurrence outperforms spread recurrence
  • Systematic ablation of recurrence, quantization, and architectural variants
  • Demonstration that standard GPTQ handles SSM outliers well at this scale