PR #1564

open

Record: GDN-Hybrid + Sliding Window Attention (cold-cache, 1.01710 BPB)

by joshkmartinezView on GitHub

val_bpb

1.0171

Architecture

Hybrid

Optimizer

AdamW

Artifact Size

15.52–15.98 MB

Training Techniques

Architecture

Gated DeltaNet hybrid

Hybrid backbone using a GDN-Hybrid stack with sliding-window attention side path.

parameters: {"layers_pattern":"[GDN×5] → SWA → [GDN×5] → SWA_shared","tokenizer":"SP1024"}

Evaluation

sliding window eval

parameters: {"cold_cache":true}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

late QAT

bits: null

scope: all

GPTQ

bits: 6

scope: all

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"mix":"MuonEq-R + AdamW"}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

SP1024 tokenizer with token reallocation for the model.

parameters: {"vocab_size":1024}