PR #1698

open

Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean)

by arsenis-cmdView on GitHub
val_bpb
1.0099
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.83 MiB

Training Techniques

Architecture
GatedDeltaNet
Replaces softmax attention with linear gated delta rule recurrence from Flash Linear Attention.
parameters: {"layers":10,"dimensions":544,"heads":8,"kv_sharing_stride":2}
K_KVShare_Wider
Wider architecture with KV sharing and 10-layer configuration built on PR #1687.
parameters: {"layers":10,"dimensions":544,"heads":8,"kv_sharing_stride":2}
Quantization
mixed int6/int8
bits: 6
scope: matrices and embeddings
late QAT
bits: null
scope: all
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_for_scalars_embeds":true}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":32768,"freeze_blocks":2,"momentum":0.9}
Evaluation
sliding window eval
parameters: {"chunk_tokens":32768}
LR Schedule
cosine decay
parameters: {"applied_to":"TTT learning rate"}

Novel Contributions

  • Uses GatedDeltaNet linear attention from Flash Linear Attention instead of softmax attention.
  • Applies legal score-first test-time training with 3-epoch SGD adaptation on previously scored chunks.
  • Builds on the K_KVShare_Wider architecture from PR #1687.
  • Achieves a 3-seed mean validation BPB of 1.00995 with sub-16 MiB artifacts.