PR #2070

open

Add hybrid nGPT-GPT-Mamba submission

by vardanbobo007View on GitHub
val_bpb
1.1730
Architecture
Hybrid
Optimizer
Muon
Artifact Size
16MB

Training Techniques

Architecture
Hybrid
Hybrid model mixing modified nGPT layers, standard GPT transformer layers, and Mamba2 layers.
parameters: {"layers":["nT","M","M","T","M","M","T","M","T","nT"]}
LeakyReLU
GPT-style MLP uses LeakyReLU squared activation.
parameters: null
ReLU²
MLP activation is squared nonlinearity.
parameters: null
Mamba
Uses standard Mamba2 layers within the hybrid stack.
parameters: {"d_state":128,"d_conv":4,"expand":2,"head_dim":64}
attention modifications
Modified nGPT layers normalize only Q and K after optimizer steps; attention and MLP both receive the layer input directly; final nGPT output is h_att + h_mlp - h.
parameters: null
Optimizer
Muon
weight_decay: 0.75
momentum: null
other_params: {"adamw":true}
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Regularization
weight decay
parameters: {"weight_decay":0.75}

Novel Contributions

  • Hybrid nGPT/GPT/Mamba architecture
  • nGPT layers placed at the beginning and end for stability
  • Modified nGPT layer design with direct input to attention and MLP and output h_att + h_mlp - h
  • Only Q and K normalization after optimizer steps
  • Sliding-window evaluation
  • Use of Muon with AdamW and high weight decay