PR #1575
openRecord: GDN-Hybrid + Sliding Window Attention + compressed-code warmdown1000 (cold-cache, 1.01671 BPB)
by joshkmartinezView on GitHub
val_bpb
1.0167
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
15,903,365 bytes
Training Techniques
Architecture
sliding window eval
GDN-hybrid backbone with a sliding-window attention side path / SWA_shared structure.
parameters: null
GDN
GDN-hybrid backbone with repeated GDN blocks and sliding-window attention.
parameters: {"layers":5}
Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"mix":"MuonEq-R + AdamW"}
LR Schedule
warmdown
parameters: {"warmdown_steps":1000}
Compression
zstd
level: 22
Other
other
Compressed-code packaging for train_gpt.py, architectures.py, and configs.py to recover artifact-size headroom.
parameters: null
Novel Contributions
- Repaired compressed-code warmdown1000 bundle for the GDN-Hybrid family
- Cold-cache 3-seed confirmation run with authoritative pulled TensorPool artifacts
- Improved mean quantized BPB over PR #1564 while reducing max artifact size
- SAFE_SUBMISSION fixed-predictor Track-A legality lane with all artifacts under 16 MB