PR #1564
openRecord: GDN-Hybrid + Sliding Window Attention (cold-cache, 1.01710 BPB)
by joshkmartinezView on GitHub
val_bpb
1.0171
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
15.52–15.98 MB
Training Techniques
Architecture
Gated DeltaNet hybrid
Hybrid backbone using a GDN-Hybrid stack with sliding-window attention side path.
parameters: {"layers_pattern":"[GDN×5] → SWA → [GDN×5] → SWA_shared","tokenizer":"SP1024"}
Evaluation
sliding window eval
parameters: {"cold_cache":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: null
scope: all
GPTQ
bits: 6
scope: all
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"mix":"MuonEq-R + AdamW"}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
SP1024 tokenizer with token reallocation for the model.
parameters: {"vocab_size":1024}
Novel Contributions
- Cold-cache confirmation of the GDN-Hybrid line
- Hybrid GDN backbone with sliding-window attention side path
- SP1024 tokenizer
- Late QAT with GPTQ int6 packaging
- 8×H100 SXM training setup