PR #2070

open

Add hybrid nGPT-GPT-Mamba submission

by vardanbobo007View on GitHub

val_bpb

1.1730

Architecture

Hybrid

Optimizer

Muon

Artifact Size

16MB

Training Techniques

Architecture

Hybrid

Hybrid model mixing modified nGPT layers, standard GPT transformer layers, and Mamba2 layers.

parameters: {"layers":["nT","M","M","T","M","M","T","M","T","nT"]}

LeakyReLU

GPT-style MLP uses LeakyReLU squared activation.

parameters: null

ReLU²

MLP activation is squared nonlinearity.

parameters: null

Mamba

Uses standard Mamba2 layers within the hybrid stack.

parameters: {"d_state":128,"d_conv":4,"expand":2,"head_dim":64}

attention modifications

Modified nGPT layers normalize only Q and K after optimizer steps; attention and MLP both receive the layer input directly; final nGPT output is h_att + h_mlp - h.

parameters: null

Optimizer

Muon

weight_decay: 0.75

momentum: null

other_params: {"adamw":true}

Evaluation

sliding window eval

parameters: null

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

Regularization

weight decay

parameters: {"weight_decay":0.75}

Novel Contributions

Hybrid nGPT/GPT/Mamba architecture
nGPT layers placed at the beginning and end for stability
Modified nGPT layer design with direct input to attention and MLP and output h_att + h_mlp - h
Only Q and K normalization after optimizer steps
Sliding-window evaluation
Use of Muon with AdamW and high weight decay