PR #1576

open

Record: GDN-Hybrid + Sliding Window Attention + compressed-code warmdown1000 - val_bpb 1.01671 (3-seed mean)

by joshkmartinezView on GitHub

val_bpb

1.0167

Architecture

Hybrid

Optimizer

AdamW

Artifact Size

15.71–15.90 MB

Training Techniques

Architecture

GDN-Hybrid

Hybrid backbone using a GDN-based stack with sliding-window attention blocks.

parameters: {"stack":"[GDN×5] → SWA → [GDN×5] → SWA_shared"}

sliding window eval

Sliding-window attention side path present in the model.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"mix":"MuonEq-R + AdamW"}

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"warmdown_steps":1000}

Quantization

GPTQ

bits: 6

scope: all

Compression

zstd

level: 22

Other

other

Compressed-code packaging for train_gpt.py, architectures.py, and configs.py to recover artifact headroom.

parameters: null