PR #1562

closed

Record: GDN-Hybrid (Gated DeltaNet + Sliding Window Attention) - quantized_bpb 1.02046

by joshkmartinezView on GitHub
val_bpb
1.0205
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15.31–15.83 MB

Training Techniques

Architecture
Gated DeltaNet
Hybrid backbone using a Gated DeltaNet-based architecture with a side path of sliding-window attention.
parameters: {"layers":5}
sliding window attention
Sliding-window attention side path in the GDN-Hybrid backbone.
parameters: null
weight tying
Shared SWA branch in the architecture line.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"mix":"MuonEq-R + AdamW"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: null
scope: all
GPTQ
bits: 6
scope: all
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
SP1024 tokenizer with token reallocation for the model backbone.
parameters: {"vocab_size":1024}

Novel Contributions

  • GDN-Hybrid backbone combining Gated DeltaNet with sliding-window attention
  • SP1024-tokenized architecture
  • Late QAT with GPTQ int6 packaging
  • MuonEq-R + AdamW training mix
  • Shared SWA branch in the hybrid line