PR #1698

open

Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean)

val_bpb

1.0099

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.83 MiB

Training Techniques

Architecture

GatedDeltaNet

Replaces softmax attention with linear gated delta rule recurrence from Flash Linear Attention.

parameters: {"layers":10,"dimensions":544,"heads":8,"kv_sharing_stride":2}

K_KVShare_Wider

Wider architecture with KV sharing and 10-layer configuration built on PR #1687.

parameters: {"layers":10,"dimensions":544,"heads":8,"kv_sharing_stride":2}

Quantization

mixed int6/int8

bits: 6

scope: matrices and embeddings

late QAT

bits: null

scope: all

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adam_for_scalars_embeds":true}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":32768,"freeze_blocks":2,"momentum":0.9}

Evaluation

sliding window eval

parameters: {"chunk_tokens":32768}

LR Schedule

cosine decay

parameters: {"applied_to":"TTT learning rate"}

Uses GatedDeltaNet linear attention from Flash Linear Attention instead of softmax attention.
Applies legal score-first test-time training with 3-epoch SGD adaptation on previously scored chunks.
Builds on the K_KVShare_Wider architecture from PR #1687.
Achieves a 3-seed mean validation BPB of 1.00995 with sub-16 MiB artifacts.