PR #1479

open

Non-record: GDN Hybrid (E2E TTT / State-Space Model) — val_bpb 1.14502

by andrewbaggio1View on GitHub
val_bpb
1.1450
Architecture
Hybrid
Optimizer
Artifact Size
13.83 MB

Training Techniques

Architecture
Gated DeltaNet
Replaces 8 of 10 attention layers with Gated DeltaNet linear/state-space layers in a hybrid model.
parameters: {"layers":8}
GQA
Uses grouped query attention in the remaining attention layers.
parameters: {"q_heads":8,"kv_heads":4}
RoPE
Applies partial rotary positional embeddings in attention layers.
parameters: {"dimensions":16}
MLP3x
Uses a 3x MLP expansion with LeakyReLU activation.
parameters: {"activation":"LeakyReLU","multiplier":3}
weight tying
Ties input and output embeddings.
parameters: null
Quantization
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 8
scope: embeddings
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
brotli
level: 11
Other
other
Uses Gated DeltaNet as an end-to-end test-time-training / state-space model via the delta-rule update equivalent to one SGD step on a reconstruction loss.
parameters: null

Novel Contributions

  • Hybrid model replacing most attention layers with Gated DeltaNet
  • Interprets Gated DeltaNet as equivalent to E2E TTT-Linear with MSE loss
  • Targets both E2E TTT and state-space model bounty items
  • Demonstrates stable training and working GPTQ quantization
  • Uses a compact 13.83 MB artifact