PR #1711

open

Record: GatedDeltaNet FLA + Score-First TTT + Brotli — val_bpb 1.00980 (3-seed mean)

val_bpb

1.0098

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.6 MB

Training Techniques

Architecture

GatedDeltaNet

GatedDeltaNet with Flash Linear Attention (K_KVShare_Wider), replacing softmax attention with O(n) linear attention.

parameters: {"layers":10,"dimensions":544,"heads":8}

Compression

Brotli

level: 11

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3,"freeze_blocks":2}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.005}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Quantization

late QAT

bits: 6

scope: matrices and embeddings

LR Schedule

cosine decay

parameters: null

Regularization

weight decay

parameters: null