PR #1370

open

Non-record: 10L Gated DeltaNet (PureGDN) — val_bpb 1.003028 (3-seed mean, legal TTT)

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.0030

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.17 MB

Training Techniques

Architecture

Gated DeltaNet

Replaced softmax attention with Gated DeltaNet linear attention in all 10 layers.

parameters: {"layers":10,"dim":512,"heads":1,"expand_k":1,"expand_v":2}

BigramHash

Used BigramHash embeddings with trigram extension for cheap n-gram context.

parameters: {"vocab":3072,"dim":112}

TrigramHash

Added an additive trigram hash channel on top of BigramHash features.

parameters: null

weight tying

Tied input and output embeddings.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.95,"start_step":3500}

SWA

parameters: {"checkpoints":12,"start_step":6450}

Quantization

late QAT

bits: 6

scope: model

GPTQ

bits: 6

scope: model

Compression

zstd

level: 22

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","momentum":0.9,"learning_rate":0.002,"epochs_per_chunk":3,"chunk_size":32768,"freeze_first_blocks":2}

Sequence Length

sequence_length

train_length: 1024

eval_length: 32768

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002}

LR Schedule

cosine decay

parameters: null

Novel Contributions

Replaced softmax attention with Gated DeltaNet (PureGDN) across all 10 layers.
Achieved a 1.003028 BPB 3-seed mean under the 16 MB artifact cap.
Added a trigram hash extension on top of BigramHash embeddings.
Demonstrated legal score-first TTT with SGD momentum on a linear-attention backbone.
Combined EMA, SWA, late QAT, GPTQ int6, and zstd-22 compression in one submission.