PR #1563
closedRecord: GDN-Hybrid (Gated DeltaNet + Sliding Window Attention)
by joshkmartinezView on GitHub
val_bpb
1.0205
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
15.31–15.83 MB
Training Techniques
Architecture
Gated DeltaNet
Hybrid backbone built around a Gated DeltaNet-style model.
parameters: null
XSA
Sliding-window attention side path / shared attention branch in the hybrid architecture.
parameters: null
SP1024 tokenizer
Uses an SP1024-tokenized input representation.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"mix":"MuonEq-R + AdamW"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: null
scope: null
GPTQ
bits: 6
scope: model weights
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: null
Novel Contributions
- GDN-Hybrid backbone combining Gated DeltaNet with sliding-window attention
- SP1024 tokenizer
- MuonEq-R + AdamW training mix
- Late QAT thresholding
- GPTQ int6 packaging with zstd-22 compression
- Three-seed record-track submission with reported mean val_bpb