PR #1479

open

Non-record: GDN Hybrid (E2E TTT / State-Space Model) — val_bpb 1.14502

by andrewbaggio1View on GitHub

val_bpb

1.1450

Architecture

Hybrid

Optimizer

—

Artifact Size

13.83 MB

Training Techniques

Architecture

Gated DeltaNet

Replaces 8 of 10 attention layers with Gated DeltaNet linear/state-space layers in a hybrid model.

parameters: {"layers":8}

GQA

Uses grouped query attention in the remaining attention layers.

parameters: {"q_heads":8,"kv_heads":4}

RoPE

Applies partial rotary positional embeddings in attention layers.

parameters: {"dimensions":16}

MLP3x

Uses a 3x MLP expansion with LeakyReLU activation.

parameters: {"activation":"LeakyReLU","multiplier":3}

weight tying

Ties input and output embeddings.

parameters: null

Quantization

GPTQ

bits: 6

scope: matrices

GPTQ

bits: 8

scope: embeddings

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

brotli

level: 11

Other

other

Uses Gated DeltaNet as an end-to-end test-time-training / state-space model via the delta-rule update equivalent to one SGD step on a reconstruction loss.

parameters: null

Novel Contributions

Hybrid model replacing most attention layers with Gated DeltaNet
Interprets Gated DeltaNet as equivalent to E2E TTT-Linear with MSE loss
Targets both E2E TTT and state-space model bounty items
Demonstrates stable training and working GPTQ quantization
Uses a compact 13.83 MB artifact