PR #1576
openRecord: GDN-Hybrid + Sliding Window Attention + compressed-code warmdown1000 - val_bpb 1.01671 (3-seed mean)
by joshkmartinezView on GitHub
val_bpb
1.0167
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
15.71–15.90 MB
Training Techniques
Architecture
GDN-Hybrid
Hybrid backbone using a GDN-based stack with sliding-window attention blocks.
parameters: {"stack":"[GDN×5] → SWA → [GDN×5] → SWA_shared"}
sliding window eval
Sliding-window attention side path present in the model.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"mix":"MuonEq-R + AdamW"}
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":1000}
Quantization
GPTQ
bits: 6
scope: all
Compression
zstd
level: 22
Other
other
Compressed-code packaging for train_gpt.py, architectures.py, and configs.py to recover artifact headroom.
parameters: null
Novel Contributions
- GDN-Hybrid backbone with sliding-window attention
- Warmdown1000 learning-rate schedule
- Fixed-predictor Track A submission with no TTT or eval-time adaptation
- GPTQ int6 quantization with zstd-22 packaging
- Compressed-code packaging to fit under the 16 MB artifact cap