PR #1544

closed

Record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.028308 (3-seed cold-cache mean)

by Abhishek8108View on GitHub

val_bpb

1.0283

Architecture

Hybrid

Optimizer

Muon

Artifact Size

14.48–14.70 MB

Training Techniques

Architecture

GatedDeltaNet

Replaces the transformer backbone with a delta-rule linear recurrence model for long-range memory.

parameters: {"layers":10,"head_dim":64,"use_short_conv":true}

SWA

Adds sliding window attention layers for local context.

parameters: {"window":512,"heads":8,"kv_heads":4,"shared":true}

BigramHash

Uses hashed bigram embeddings as an auxiliary token representation.

parameters: {"dimensions":[3072,112]}

TrigramHash

Uses trigram hash embeddings as an auxiliary token representation.

parameters: null

SmearGate

Applies a smear gate on token embeddings.

parameters: null

weight tying

Shares weights across both SWA layers.

parameters: null

Quantization

GPTQ

bits: 6

scope: all

Compression

zstd

level: 22

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"window":512,"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

cosine decay

parameters: {"warmup_steps":100}

Regularization

logit softcap

parameters: {"value":30}

Other

other

Cold-cache 3-seed training on fresh pods with Triton JIT overhead explicitly accounted for.

parameters: {"seeds":3}

Novel Contributions

First non-transformer architecture in the 10-minute record track
Hybrid Gated DeltaNet + Sliding Window Attention backbone
3-seed cold-cache mean record result
Weight-shared SWA layers combined with recurrent delta-rule memory
Full-Hessian int6 GPTQ quantization with zstd compression