PR #1712

open

Record: GatedDeltaNet FLA + Brotli (No TTT) — val_bpb 1.01902 (3-seed mean)

val_bpb

1.0190

Architecture

Hybrid

Optimizer

Muon

Artifact Size

~15.6 MB

Training Techniques

Architecture

GatedDeltaNet

Replaces softmax attention with GatedDeltaNet / Flash Linear Attention (FLA) linear attention.

parameters: {"layers":10,"dimensions":544,"heads":8}

BigramHash

Uses bigram hash features with trigram augmentation.

parameters: {"dimensions":"3072 x 112"}

Quantization

late QAT

bits: 6

scope: matrices and embeddings

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"scalars_and_embeds_optimizer":"Adam"}

Compression

Brotli

level: 11

Other

other

Pure fixed predictor with no test-time training or eval-time adaptation.

parameters: null