PR #969

closed

Non-record: GatedDeltaNet SSM via fla library — 1.2907 bpb, 15.79MB

val_bpb
1.2907
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15.79MB

Training Techniques

Architecture
GatedDeltaNet
Selective state space model replacing attention with delta-rule recurrence and chunk-parallel Triton kernels.
parameters: {"layers":12,"dimensions":384,"chunk_size":64}
U-Net skip connections
Encoder/decoder-style skip connections with learned skip weights.
parameters: {"layers":12}
BigramHash
Bigram hash embedding used alongside tied token embeddings.
parameters: {"vocab":1536,"dimensions":128}
LeakyReLU
LeakyReLU squared MLP activation.
parameters: {"slope":0.5}
weight tying
Tied token embeddings.
parameters: null
Regularization
logit softcap
parameters: {"degree":5,"cap":30}
z-loss
parameters: {"weight":0.0001}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"newton_schulz":true}
Adam
weight_decay: null
momentum: null
other_params: {"used_for":["scalars","embeddings","GDN delta-rule params"]}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Gated DeltaNet selective state space model using fla Triton kernels
  • Non-record unlimited compute baseline under the 16MB artifact limit
  • Explicit routing of delta-rule parameters to Adam to preserve recurrence dynamics
  • Demonstration of a pure SSM alternative to attention in parameter golf
  • Int8+zlib artifact compression achieving 15.79MB