PR #970

open

Non-record: GatedDeltaNet SSM via fla library — 1.2907 bpb, 15.79MB

by dnldszView on GitHub

val_bpb

1.2907

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15.79 MB

Training Techniques

Architecture

GatedDeltaNet

Replaces attention with a selective state space model using delta-rule recurrence and fused Triton kernels.

parameters: {"layers":12,"dimensions":384,"head_dim":64,"heads_per_layer":6,"chunk_size":64}

U-Net skip connections

Uses learned U-Net style skip connections in the model stack.

parameters: {"layers":12}

BigramHash

Adds a BigramHash embedding alongside tied token embeddings.

parameters: {"vocab":1536,"dimensions":128}

weight tying

Uses tied token embeddings.

parameters: null

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

logit softcap

Applies polynomial softcap to logits.

parameters: {"degree":5,"cap":30}

Regularization

z-loss

parameters: {"weight":0.0001}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"newton_schulz":true}

Adam

weight_decay: null

momentum: null

other_params: {"scope":"scalars, embeddings, and GDN-specific delta-rule params"}

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Novel Contributions

Gated DeltaNet selective state space model using flash-linear-attention Triton kernels
Non-attention SSM baseline for the parameter golf challenge
Chunk-parallel delta-rule recurrence with chunk size 64
U-Net skip connections combined with GatedDeltaNet
BigramHash embedding and polynomial logit softcap in a compact 16MB submission
Routing delta-rule parameters to Adam while using Muon for 2D weights