PR #1607

open

Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB)

val_bpb

1.4765

Architecture

Hybrid

Optimizer

—

Artifact Size

8.2MB

Training Techniques

Architecture

depth recurrence

Hinge-point multi-recurrence repeats the U-Net hinge layers to create more virtual layers with zero extra parameters.

parameters: {"layers":[3,4],"repeats":2,"virtual_layers":12,"physical_layers":8}

U-Net skip connections

Uses a U-Net encoder-decoder style hybrid with skip connections.

parameters: null

GQA

Attention layer uses grouped query attention.

parameters: {"heads":8,"kv_heads":4}

RoPE

Attention uses RoPE positional encoding with base 10000.

parameters: {"base":10000}

depth recurrence

Recurrence is enabled partway through training to allow initial convergence before recurrence is applied.

parameters: {"start_frac":0.35,"start_step":350}

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: 9

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Hybrid architecture combining Mamba-3 SSM layers with one attention layer in a Nemotron-H inspired layout.

parameters: {"mamba_layers":7,"attention_layers":1}