PR #1712

open

Record: GatedDeltaNet FLA + Brotli (No TTT) — val_bpb 1.01902 (3-seed mean)

by aamodbhattView on GitHub
val_bpb
1.0190
Architecture
Hybrid
Optimizer
Muon
Artifact Size
~15.6 MB

Training Techniques

Architecture
GatedDeltaNet
Replaces softmax attention with GatedDeltaNet / Flash Linear Attention (FLA) linear attention.
parameters: {"layers":10,"dimensions":544,"heads":8}
BigramHash
Uses bigram hash features with trigram augmentation.
parameters: {"dimensions":"3072 x 112"}
Quantization
late QAT
bits: 6
scope: matrices and embeddings
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"scalars_and_embeds_optimizer":"Adam"}
Compression
Brotli
level: 11
Other
other
Pure fixed predictor with no test-time training or eval-time adaptation.
parameters: null

Novel Contributions

  • GatedDeltaNet (FLA) architecture replacing softmax attention
  • Brotli-11 artifact compression to reduce size under 16 MB
  • Pure fixed predictor with no TTT
  • 3-seed mean submission with reported standard deviation