PR #1544
closedRecord: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.028308 (3-seed cold-cache mean)
by Abhishek8108View on GitHub
val_bpb
1.0283
Architecture
Hybrid
Optimizer
Muon
Artifact Size
14.48–14.70 MB
Training Techniques
Architecture
GatedDeltaNet
Replaces the transformer backbone with a delta-rule linear recurrence model for long-range memory.
parameters: {"layers":10,"head_dim":64,"use_short_conv":true}
SWA
Adds sliding window attention layers for local context.
parameters: {"window":512,"heads":8,"kv_heads":4,"shared":true}
BigramHash
Uses hashed bigram embeddings as an auxiliary token representation.
parameters: {"dimensions":[3072,112]}
TrigramHash
Uses trigram hash embeddings as an auxiliary token representation.
parameters: null
SmearGate
Applies a smear gate on token embeddings.
parameters: null
weight tying
Shares weights across both SWA layers.
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Compression
zstd
level: 22
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"window":512,"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
cosine decay
parameters: {"warmup_steps":100}
Regularization
logit softcap
parameters: {"value":30}
Other
other
Cold-cache 3-seed training on fresh pods with Triton JIT overhead explicitly accounted for.
parameters: {"seeds":3}
Novel Contributions
- First non-transformer architecture in the 10-minute record track
- Hybrid Gated DeltaNet + Sliding Window Attention backbone
- 3-seed cold-cache mean record result
- Weight-shared SWA layers combined with recurrent delta-rule memory
- Full-Hessian int6 GPTQ quantization with zstd compression