PR #1734

open

Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts)

by yahya010View on GitHub
val_bpb
1.0108
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15.53 MB

Training Techniques

Architecture
Gated Attention
GatedDeltaNet / GDN-based architecture with causal left-to-right attention behavior.
parameters: {"layers":10,"dimensions":544,"heads":8,"kv_share_stride":2}
Quantization
GPTQ
bits: 6
scope: model weights
Compression
brotli
level: 11
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":32768,"freeze_blocks":2,"momentum":0.9}
Optimizer
Adam
weight_decay: null
momentum: null
other_params: null
Muon
weight_decay: null
momentum: null
other_params: null
Regularization
weight decay
parameters: null
Sequence Length
sequence_length
train_length: 32768
eval_length: null
Other
other
Optional macro-phase SGD TTT infrastructure was added but disabled in the scored run.
parameters: {"ttt_macro_phases":0}

Novel Contributions

  • Replaced zstandard-22 with brotli-11 to bring artifacts under the 16,000,000-byte cap without quality loss.
  • Kept clip_range=31 to avoid the quantization penalty associated with reducing clip range.
  • Added optional macro-phase SGD TTT infrastructure, but disabled it in the scored run.
  • Achieved a valid 3-seed mean val_bpb of 1.01080 with all artifacts under the size limit.