PR #970
openNon-record: GatedDeltaNet SSM via fla library — 1.2907 bpb, 15.79MB
by dnldszView on GitHub
val_bpb
1.2907
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15.79 MB
Training Techniques
Architecture
GatedDeltaNet
Replaces attention with a selective state space model using delta-rule recurrence and fused Triton kernels.
parameters: {"layers":12,"dimensions":384,"head_dim":64,"heads_per_layer":6,"chunk_size":64}
U-Net skip connections
Uses learned U-Net style skip connections in the model stack.
parameters: {"layers":12}
BigramHash
Adds a BigramHash embedding alongside tied token embeddings.
parameters: {"vocab":1536,"dimensions":128}
weight tying
Uses tied token embeddings.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
logit softcap
Applies polynomial softcap to logits.
parameters: {"degree":5,"cap":30}
Regularization
z-loss
parameters: {"weight":0.0001}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"newton_schulz":true}
Adam
weight_decay: null
momentum: null
other_params: {"scope":"scalars, embeddings, and GDN-specific delta-rule params"}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- Gated DeltaNet selective state space model using flash-linear-attention Triton kernels
- Non-attention SSM baseline for the parameter golf challenge
- Chunk-parallel delta-rule recurrence with chunk size 64
- U-Net skip connections combined with GatedDeltaNet
- BigramHash embedding and polynomial logit softcap in a compact 16MB submission
- Routing delta-rule parameters to Adam while using Muon for 2D weights