PR #1632

open

Record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.0274 (2-seed mean)

by HkoyuerView on GitHub

val_bpb

1.0274

Architecture

Hybrid

Optimizer

Muon

Artifact Size

14.6-14.8 MB

Training Techniques

Architecture

Gated DeltaNet

Recurrent key-value associative memory updated by the delta rule.

parameters: {"layers":10}

SWA

Sliding Window Attention layers with shared weights across both SWA layers.

parameters: {"layers":2,"window":512,"heads":8,"kv_heads":4}

BigramHash

Bigram hash embeddings used in the input stack.

parameters: {"dimensions":[3072,112]}

TrigramHash

Trigram hash embeddings used in the input stack.

parameters: null

SmearGate

Gate applied on token embeddings.

parameters: null

GQA

Attention uses grouped query-style key/value head reduction via 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Regularization

logit softcap

parameters: {"value":30}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz_steps":5,"adamw_for":["embeddings","scalars"]}

Quantization

GPTQ

bits: 6

scope: matrices

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"causal":true}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

Independent reproduction of GDN-Hybrid with a 2-seed cold-cache mean val_bpb of 1.0274
GDN-Hybrid architecture combining Gated DeltaNet blocks with shared-weight SWA layers
Full-Hessian GPTQ quantization with int6 matrices and zstd compression under the 16MB limit
Cold-cache training protocol on fresh 8xH100 SXM pods with verified Triton JIT overhead
ECN research claiming zero-artifact-cost post-training logit correction improvements