PR #1370

open

Non-record: 10L Gated DeltaNet (PureGDN) — val_bpb 1.003028 (3-seed mean, legal TTT)

by Christopher-Lee-McClendonView on GitHub
val_bpb
1.0030
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.17 MB

Training Techniques

Architecture
Gated DeltaNet
Replaced softmax attention with Gated DeltaNet linear attention in all 10 layers.
parameters: {"layers":10,"dim":512,"heads":1,"expand_k":1,"expand_v":2}
BigramHash
Used BigramHash embeddings with trigram extension for cheap n-gram context.
parameters: {"vocab":3072,"dim":112}
TrigramHash
Added an additive trigram hash channel on top of BigramHash features.
parameters: null
weight tying
Tied input and output embeddings.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.95,"start_step":3500}
SWA
parameters: {"checkpoints":12,"start_step":6450}
Quantization
late QAT
bits: 6
scope: model
GPTQ
bits: 6
scope: model
Compression
zstd
level: 22
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD","momentum":0.9,"learning_rate":0.002,"epochs_per_chunk":3,"chunk_size":32768,"freeze_first_blocks":2}
Sequence Length
sequence_length
train_length: 1024
eval_length: 32768
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002}
LR Schedule
cosine decay
parameters: null

Novel Contributions

  • Replaced softmax attention with Gated DeltaNet (PureGDN) across all 10 layers.
  • Achieved a 1.003028 BPB 3-seed mean under the 16 MB artifact cap.
  • Added a trigram hash extension on top of BigramHash embeddings.
  • Demonstrated legal score-first TTT with SGD momentum on a linear-attention backbone.
  • Combined EMA, SWA, late QAT, GPTQ int6, and zstd-22 compression in one submission.