PR #1562

closed

Record: GDN-Hybrid (Gated DeltaNet + Sliding Window Attention) - quantized_bpb 1.02046

by joshkmartinezView on GitHub

val_bpb

1.0205

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15.31–15.83 MB

Training Techniques

Architecture

Gated DeltaNet

Hybrid backbone using a Gated DeltaNet-based architecture with a side path of sliding-window attention.

parameters: {"layers":5}

sliding window attention

Sliding-window attention side path in the GDN-Hybrid backbone.

parameters: null

weight tying

Shared SWA branch in the architecture line.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"mix":"MuonEq-R + AdamW"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

late QAT

bits: null

scope: all

GPTQ

bits: 6

scope: all

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

SP1024 tokenizer with token reallocation for the model backbone.

parameters: {"vocab_size":1024}

Novel Contributions

GDN-Hybrid backbone combining Gated DeltaNet with sliding-window attention
SP1024-tokenized architecture
Late QAT with GPTQ int6 packaging
MuonEq-R + AdamW training mix
Shared SWA branch in the hybrid line