PR #875

open

New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel)

by shalyhinpavelView on GitHub
val_bpb
1.0226
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
14.1 MB

Training Techniques

Architecture
Gated Attention
Replaced standard attention with GatedDeltaBlock / Gated DeltaNet layers; described as 8 DeltaNet layers plus a final standard attention layer.
parameters: {"layers":8,"final_attention_layer":1,"n_embd":384}
weight tying
Standard embedding and lm_head tying used.
parameters: null
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"fused":true}
Other
other
Dynamic batch size and chunk size curriculum based on elapsed time; global batch schedule 64 -> 128 -> 192.
parameters: {"global_batch_schedule":[64,128,192]}
other
FastLoader with non-blocking prefetching and pin_memory to reduce dataloader bottlenecks.
parameters: null
other
Strict allow_tf32 enforcement for hardware throughput optimization.
parameters: null

Novel Contributions

  • Pure neural Gated DeltaNet baseline without TTT or external cache
  • Dynamic batch size and chunk size curriculum based on elapsed time
  • FastLoader with non-blocking prefetching and pinned memory
  • Int8-compressed sub-16MB artifact with 3-seed mean validation BPB of 1.0226