PR #2073

open

Hybrid ngpt gpt mamba submission

by vardanbobo007View on GitHub
val_bpb
1.1726
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15,532,451 bytes

Training Techniques

Architecture
Hybrid
Mixed nGPT transformer layers, standard GPT transformer layers, and Mamba2 layers in one model.
parameters: {"layers":["nT","M","M","T","M","M","T","M","T","nT"]}
LeakyReLU
GPT-style MLP uses LeakyReLU(x)^2 activation.
parameters: null
ReLU²
MLP activation is squared nonlinearity in the form LeakyReLU(x)^2.
parameters: null
Mamba
Standard Mamba2 layers used as part of the hybrid stack.
parameters: {"d_state":128,"d_conv":4,"expand":2,"head_dim":64}
attention modification
In both nGPT and GPT-style layers, the MLP receives the layer input directly rather than the attention output.
parameters: null
attention modification
Modified nGPT layer output combines attention and MLP outputs as h_att + h_mlp - h.
parameters: null
Optimizer
Muon
weight_decay: 0.75
momentum: null
other_params: {"adamw":true}
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Regularization
weight decay
parameters: {"value":0.75}

Novel Contributions

  • Hybrid stack combining nGPT, GPT, and Mamba2 layers
  • Placing modified nGPT layers at the beginning and end for stability
  • Modified nGPT formulation with direct MLP input and h_att + h_mlp - h output
  • Selective Q/K normalization after optimizer steps
  • Sliding-window evaluation
  • Use of Muon with AdamW and strong weight decay