PR #1553

open

Non-record: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.209735

by Abhishek8108View on GitHub
val_bpb
1.2097
Architecture
Hybrid
Optimizer
Muon
Artifact Size
14.48–14.70 MB

Training Techniques

Architecture
Gated DeltaNet
Replaces the transformer backbone with a delta-rule linear recurrence memory module.
parameters: {"layers":10,"head_dim":64,"expand_v":1,"use_short_conv":true}
SWA
Uses sliding window attention layers for local context.
parameters: {"layers":2,"window":512,"heads":8,"kv_heads":4,"shared_weights":true}
BigramHash
Adds bigram hash embeddings to the token representation.
parameters: {"dimensions":3072,"hash_size":112}
TrigramHash
Adds trigram hash embeddings to the token representation.
parameters: null
SmearGate
Applies a smear gate on token embeddings.
parameters: null
Gated Attention
Uses learnable per-head QK gain scaling in attention.
parameters: {"qk_gain":5}
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_steps":5,"adamw_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997,"applied_at_end":true}
Quantization
GPTQ
bits: 6
scope: linear layers
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":512,"window":512}
Sequence Length
sequence_length
train_length: 2048
eval_length: 1024
LR Schedule
cosine decay
parameters: {"warmup_steps":100,"constant_after_warmup":true}

Novel Contributions

  • GDN-Hybrid architecture combining Gated DeltaNet with Sliding Window Attention
  • Correction of a BPB evaluation bug caused by double-counting leading-space bytes
  • Post-hoc rescoring of three saved artifacts with the corrected BPB formula
  • Use of shared-weight SWA layers in a Griffin-style hybrid layout
  • GPTQ calibration on model-generated synthetic sequences only