PR #1712
openRecord: GatedDeltaNet FLA + Brotli (No TTT) — val_bpb 1.01902 (3-seed mean)
by aamodbhattView on GitHub
val_bpb
1.0190
Architecture
Hybrid
Optimizer
Muon
Artifact Size
~15.6 MB
Training Techniques
Architecture
GatedDeltaNet
Replaces softmax attention with GatedDeltaNet / Flash Linear Attention (FLA) linear attention.
parameters: {"layers":10,"dimensions":544,"heads":8}
BigramHash
Uses bigram hash features with trigram augmentation.
parameters: {"dimensions":"3072 x 112"}
Quantization
late QAT
bits: 6
scope: matrices and embeddings
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"scalars_and_embeds_optimizer":"Adam"}
Compression
Brotli
level: 11
Other
other
Pure fixed predictor with no test-time training or eval-time adaptation.
parameters: null
Novel Contributions
- GatedDeltaNet (FLA) architecture replacing softmax attention
- Brotli-11 artifact compression to reduce size under 16 MB
- Pure fixed predictor with no TTT
- 3-seed mean submission with reported standard deviation