PR #875

open

New Record: Pure Neural GDN 1.0226 BPB (shalyhinpavel)

by shalyhinpavelView on GitHub

val_bpb

1.0226

Architecture

Hybrid

Optimizer

AdamW

Artifact Size

14.1 MB

Training Techniques

Architecture

Gated Attention

Replaced standard attention with GatedDeltaBlock / Gated DeltaNet layers; described as 8 DeltaNet layers plus a final standard attention layer.

parameters: {"layers":8,"final_attention_layer":1,"n_embd":384}

weight tying

Standard embedding and lm_head tying used.

parameters: null

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"fused":true}

Other

other

Dynamic batch size and chunk size curriculum based on elapsed time; global batch schedule 64 -> 128 -> 192.

parameters: {"global_batch_schedule":[64,128,192]}

other

FastLoader with non-blocking prefetching and pin_memory to reduce dataloader bottlenecks.

parameters: null

other

Strict allow_tf32 enforcement for hardware throughput optimization.

parameters: null