PR #1632
openRecord: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.0274 (2-seed mean)
by HkoyuerView on GitHub
val_bpb
1.0274
Architecture
Hybrid
Optimizer
Muon
Artifact Size
14.6-14.8 MB
Training Techniques
Architecture
Gated DeltaNet
Recurrent key-value associative memory updated by the delta rule.
parameters: {"layers":10}
SWA
Sliding Window Attention layers with shared weights across both SWA layers.
parameters: {"layers":2,"window":512,"heads":8,"kv_heads":4}
BigramHash
Bigram hash embeddings used in the input stack.
parameters: {"dimensions":[3072,112]}
TrigramHash
Trigram hash embeddings used in the input stack.
parameters: null
SmearGate
Gate applied on token embeddings.
parameters: null
GQA
Attention uses grouped query-style key/value head reduction via 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Regularization
logit softcap
parameters: {"value":30}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_steps":5,"adamw_for":["embeddings","scalars"]}
Quantization
GPTQ
bits: 6
scope: matrices
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"causal":true}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- Independent reproduction of GDN-Hybrid with a 2-seed cold-cache mean val_bpb of 1.0274
- GDN-Hybrid architecture combining Gated DeltaNet blocks with shared-weight SWA layers
- Full-Hessian GPTQ quantization with int6 matrices and zstd compression under the 16MB limit
- Cold-cache training protocol on fresh 8xH100 SXM pods with verified Triton JIT overhead
- ECN research claiming zero-artifact-cost post-training logit correction improvements