PR #969

closed

Non-record: GatedDeltaNet SSM via fla library — 1.2907 bpb, 15.79MB

by dnldszView on GitHub

val_bpb

1.2907

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15.79MB

Training Techniques

Architecture

GatedDeltaNet

Selective state space model replacing attention with delta-rule recurrence and chunk-parallel Triton kernels.

parameters: {"layers":12,"dimensions":384,"chunk_size":64}

U-Net skip connections

Encoder/decoder-style skip connections with learned skip weights.

parameters: {"layers":12}

BigramHash

Bigram hash embedding used alongside tied token embeddings.

parameters: {"vocab":1536,"dimensions":128}

LeakyReLU

LeakyReLU squared MLP activation.

parameters: {"slope":0.5}

weight tying

Tied token embeddings.

parameters: null

Regularization

logit softcap

parameters: {"degree":5,"cap":30}

z-loss

parameters: {"weight":0.0001}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"newton_schulz":true}

Adam

weight_decay: null

momentum: null

other_params: {"used_for":["scalars","embeddings","GDN delta-rule params"]}

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Novel Contributions

Gated DeltaNet selective state space model using fla Triton kernels
Non-record unlimited compute baseline under the 16MB artifact limit
Explicit routing of delta-rule parameters to Adam to preserve recurrence dynamics
Demonstration of a pure SSM alternative to attention in parameter golf
Int8+zlib artifact compression achieving 15.79MB