PR #970

open

Non-record: GatedDeltaNet SSM via fla library — 1.2907 bpb, 15.79MB

val_bpb
1.2907
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15.79 MB

Training Techniques

Architecture
GatedDeltaNet
Replaces attention with a selective state space model using delta-rule recurrence and fused Triton kernels.
parameters: {"layers":12,"dimensions":384,"head_dim":64,"heads_per_layer":6,"chunk_size":64}
U-Net skip connections
Uses learned U-Net style skip connections in the model stack.
parameters: {"layers":12}
BigramHash
Adds a BigramHash embedding alongside tied token embeddings.
parameters: {"vocab":1536,"dimensions":128}
weight tying
Uses tied token embeddings.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
logit softcap
Applies polynomial softcap to logits.
parameters: {"degree":5,"cap":30}
Regularization
z-loss
parameters: {"weight":0.0001}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"newton_schulz":true}
Adam
weight_decay: null
momentum: null
other_params: {"scope":"scalars, embeddings, and GDN-specific delta-rule params"}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Gated DeltaNet selective state space model using flash-linear-attention Triton kernels
  • Non-attention SSM baseline for the parameter golf challenge
  • Chunk-parallel delta-rule recurrence with chunk size 64
  • U-Net skip connections combined with GatedDeltaNet
  • BigramHash embedding and polynomial logit softcap in a compact 16MB submission
  • Routing delta-rule parameters to Adam while using Muon for 2D weights