PR #1576

open

Record: GDN-Hybrid + Sliding Window Attention + compressed-code warmdown1000 - val_bpb 1.01671 (3-seed mean)

by joshkmartinezView on GitHub
val_bpb
1.0167
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
15.71–15.90 MB

Training Techniques

Architecture
GDN-Hybrid
Hybrid backbone using a GDN-based stack with sliding-window attention blocks.
parameters: {"stack":"[GDN×5] → SWA → [GDN×5] → SWA_shared"}
sliding window eval
Sliding-window attention side path present in the model.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"mix":"MuonEq-R + AdamW"}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":1000}
Quantization
GPTQ
bits: 6
scope: all
Compression
zstd
level: 22
Other
other
Compressed-code packaging for train_gpt.py, architectures.py, and configs.py to recover artifact headroom.
parameters: null

Novel Contributions

  • GDN-Hybrid backbone with sliding-window attention
  • Warmdown1000 learning-rate schedule
  • Fixed-predictor Track A submission with no TTT or eval-time adaptation
  • GPTQ int6 quantization with zstd-22 packaging
  • Compressed-code packaging to fit under the 16 MB artifact cap