PR #1553

open

Non-record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.209735

by Abhishek8108View on GitHub

val_bpb

1.2097

Architecture

Hybrid

Optimizer

Muon

Artifact Size

14.48–14.70 MB

Training Techniques

Architecture

Gated DeltaNet

Replaces the transformer backbone with a delta-rule linear recurrence memory module.

parameters: {"layers":10,"head_dim":64,"expand_v":1,"use_short_conv":true}

SWA

Uses sliding window attention layers for local context.

parameters: {"layers":2,"window":512,"heads":8,"kv_heads":4,"shared_weights":true}

BigramHash

Adds bigram hash embeddings to the token representation.

parameters: {"dimensions":3072,"hash_size":112}

TrigramHash

Adds trigram hash embeddings to the token representation.

parameters: null

SmearGate

Applies a smear gate on token embeddings.

parameters: null

Gated Attention

Uses learnable per-head QK gain scaling in attention.

parameters: {"qk_gain":5}

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz_steps":5,"adamw_for":"embeddings/scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997,"applied_at_end":true}

Quantization

GPTQ

bits: 6

scope: linear layers

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":512,"window":512}

Sequence Length

sequence_length

train_length: 2048

eval_length: 1024

LR Schedule

cosine decay

parameters: {"warmup_steps":100,"constant_after_warmup":true}

Novel Contributions

GDN-Hybrid architecture combining Gated DeltaNet with Sliding Window Attention
Correction of a BPB evaluation bug caused by double-counting leading-space bytes
Post-hoc rescoring of three saved artifacts with the corrected BPB formula
Use of shared-weight SWA layers in a Griffin-style hybrid layout
GPTQ calibration on model-generated synthetic sequences only