PR #2073

open

Hybrid ngpt gpt mamba submission

by vardanbobo007View on GitHub

val_bpb

1.1726

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15,532,451 bytes

Training Techniques

Architecture

Hybrid

Mixed nGPT transformer layers, standard GPT transformer layers, and Mamba2 layers in one model.

parameters: {"layers":["nT","M","M","T","M","M","T","M","T","nT"]}

LeakyReLU

GPT-style MLP uses LeakyReLU(x)^2 activation.

parameters: null

ReLU²

MLP activation is squared nonlinearity in the form LeakyReLU(x)^2.

parameters: null

Mamba

Standard Mamba2 layers used as part of the hybrid stack.

parameters: {"d_state":128,"d_conv":4,"expand":2,"head_dim":64}

attention modification

In both nGPT and GPT-style layers, the MLP receives the layer input directly rather than the attention output.

parameters: null

attention modification

Modified nGPT layer output combines attention and MLP outputs as h_att + h_mlp - h.

parameters: null

Optimizer

Muon

weight_decay: 0.75

momentum: null

other_params: {"adamw":true}

Evaluation

sliding window eval

parameters: null

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

Regularization

weight decay

parameters: {"value":0.75}