PR #1563

closed

Record: GDN-Hybrid (Gated DeltaNet + Sliding Window Attention)

by joshkmartinezView on GitHub
val_bpb
1.0205
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
15.31–15.83 MB

Training Techniques

Architecture
Gated DeltaNet
Hybrid backbone built around a Gated DeltaNet-style model.
parameters: null
XSA
Sliding-window attention side path / shared attention branch in the hybrid architecture.
parameters: null
SP1024 tokenizer
Uses an SP1024-tokenized input representation.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"mix":"MuonEq-R + AdamW"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: null
scope: null
GPTQ
bits: 6
scope: model weights
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: null

Novel Contributions

  • GDN-Hybrid backbone combining Gated DeltaNet with sliding-window attention
  • SP1024 tokenizer
  • MuonEq-R + AdamW training mix
  • Late QAT thresholding
  • GPTQ int6 packaging with zstd-22 compression
  • Three-seed record-track submission with reported mean val_bpb