PR #1545

closed

Record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.028308 (3-seed cold-cache mean)

by Abhishek8108View on GitHub

val_bpb

1.0283

Architecture

Hybrid

Optimizer

Muon

Artifact Size

14.48–14.70 MB

Training Techniques

Architecture

GatedDeltaNet

Replaces the transformer backbone with Gated DeltaNet recurrent layers using delta-rule memory updates.

parameters: {"layers":10,"head_dim":64,"use_short_conv":true}

SWA

Adds sliding window attention layers for local context modeling.

parameters: {"window":512,"heads":8,"kv_heads":4,"shared_weights":true}

BigramHash

Uses bigram hash embeddings as an auxiliary input feature.

parameters: {"shape":"3072x112"}

TrigramHash

Uses trigram hash embeddings as an auxiliary input feature.

parameters: null

SmearGate

Applies SmearGate on token embeddings.

parameters: null

weight tying

Shares weights across the two SWA layers.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

Quantization

GPTQ

bits: 6

scope: all

Compression

zstd

level: 22

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"window":512,"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

cosine decay

parameters: {"warmup_steps":100}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz_steps":5,"adamw_for":"embeddings/scalars"}

Novel Contributions

First non-transformer architecture in the 10-minute record track
Hybrid model combining Gated DeltaNet recurrence with sliding window attention
Weight-shared SWA layers
Full-Hessian int6 GPTQ quantization with zstd compression
Cold-cache 3-seed mean record of 1.028308 val_bpb