PR #745

open

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean)

by stukenovView on GitHub
val_bpb
1.0222
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,857,972 bytes

Training Techniques

Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
XSA
Exclusive Self-Attention applied on all virtual layers.
parameters: {"layers":13}
Value Residual Learning
Layer 0 value output is blended into subsequent attention via learned sigmoid gates.
parameters: null
Gated Attention
Per-head sigmoid gates on attention output.
parameters: null
CROWN-Q
Curvature-weighted quantization penalty during warmdown to improve int6 quantization robustness.
parameters: null
depth recurrence
Layers 4 and 5 are repeated, creating 13 virtual layers from 11 physical layers.
parameters: {"physical_layers":11,"virtual_layers":13,"repeated_layers":[4,5]}
Other
other
5-expert Hedge Mixer for online context mixing during TTT evaluation using neural, unigram, bigram, trigram, and entropy experts.
parameters: {"experts":5}
Test-Time Training
score-first TTT
parameters: {"epochs":1,"learning_rate":0.002,"momentum":0.9}
Evaluation
sliding window eval
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Compression
lzma
level: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"value":0.04}

Novel Contributions

  • XSA on all layers
  • Value Residual Learning
  • Gated Attention
  • CROWN-Q curvature-weighted quantization penalty
  • Depth recurrence with repeated layers 4 and 5
  • 5-expert Hedge Mixer for online context mixing during TTT
  • Score-first test-time training with n-gram tables built only from already-scored tokens