PR #1711

open

Record: GatedDeltaNet FLA + Score-First TTT + Brotli — val_bpb 1.00980 (3-seed mean)

by aamodbhattView on GitHub
val_bpb
1.0098
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.6 MB

Training Techniques

Architecture
GatedDeltaNet
GatedDeltaNet with Flash Linear Attention (K_KVShare_Wider), replacing softmax attention with O(n) linear attention.
parameters: {"layers":10,"dimensions":544,"heads":8}
Compression
Brotli
level: 11
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3,"freeze_blocks":2}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.005}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
late QAT
bits: 6
scope: matrices and embeddings
LR Schedule
cosine decay
parameters: null
Regularization
weight decay
parameters: null

Novel Contributions

  • GatedDeltaNet FLA K_KVShare_Wider architecture with linear attention
  • Brotli-11 artifact compression to keep all artifacts under 16 MB
  • Score-first test-time training protocol with SGD and frozen initial blocks
  • 3-seed mean validation score of 1.00980 bpb