PR #1545
closedRecord: GDN-Hybrid (Gated DeltaNet + SWA) — val_bpb 1.028308 (3-seed cold-cache mean)
by Abhishek8108View on GitHub
val_bpb
1.0283
Architecture
Hybrid
Optimizer
Muon
Artifact Size
14.48–14.70 MB
Training Techniques
Architecture
GatedDeltaNet
Replaces the transformer backbone with Gated DeltaNet recurrent layers using delta-rule memory updates.
parameters: {"layers":10,"head_dim":64,"use_short_conv":true}
SWA
Adds sliding window attention layers for local context modeling.
parameters: {"window":512,"heads":8,"kv_heads":4,"shared_weights":true}
BigramHash
Uses bigram hash embeddings as an auxiliary input feature.
parameters: {"shape":"3072x112"}
TrigramHash
Uses trigram hash embeddings as an auxiliary input feature.
parameters: null
SmearGate
Applies SmearGate on token embeddings.
parameters: null
weight tying
Shares weights across the two SWA layers.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Quantization
GPTQ
bits: 6
scope: all
Compression
zstd
level: 22
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"window":512,"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
cosine decay
parameters: {"warmup_steps":100}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_steps":5,"adamw_for":"embeddings/scalars"}
Novel Contributions
- First non-transformer architecture in the 10-minute record track
- Hybrid model combining Gated DeltaNet recurrence with sliding window attention
- Weight-shared SWA layers
- Full-Hessian int6 GPTQ quantization with zstd compression
- Cold-cache 3-seed mean record of 1.028308 val_bpb