PR #1575

open

Record: GDN-Hybrid + Sliding Window Attention + compressed-code warmdown1000 (cold-cache, 1.01671 BPB)

by joshkmartinezView on GitHub

val_bpb

1.0167

Architecture

Hybrid

Optimizer

AdamW

Artifact Size

15,903,365 bytes

Training Techniques

Architecture

sliding window eval

GDN-hybrid backbone with a sliding-window attention side path / SWA_shared structure.

parameters: null

GDN

GDN-hybrid backbone with repeated GDN blocks and sliding-window attention.

parameters: {"layers":5}

Quantization

GPTQ

bits: 6

scope: all

Weight Averaging

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"mix":"MuonEq-R + AdamW"}

LR Schedule

warmdown

parameters: {"warmdown_steps":1000}

Compression

zstd

level: 22

Other

other

Compressed-code packaging for train_gpt.py, architectures.py, and configs.py to recover artifact-size headroom.

parameters: null

Repaired compressed-code warmdown1000 bundle for the GDN-Hybrid family
Cold-cache 3-seed confirmation run with authoritative pulled TensorPool artifacts
Improved mean quantized BPB over PR #1564 while reducing max artifact size
SAFE_SUBMISSION fixed-predictor Track-A legality lane with all artifacts under 16 MB