PR #733

closed

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean)

by stukenovView on GitHub
val_bpb
1.0278
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.8 MB

Training Techniques

Architecture
XSA
Exclusive Self-Attention applied to all layers instead of only the last few layers.
parameters: {"layers":11}
Value Residual Learning
Blends layer 0 value outputs into subsequent attention via learned sigmoid gates.
parameters: null
Gated Attention
Per-head sigmoid gates on attention outputs.
parameters: null
depth recurrence
Layers 4 and 5 are repeated to create 13 virtual layers from 11 physical layers.
parameters: {"physical_layers":11,"virtual_layers":13,"repeated_layers":[4,5]}
Hedge Mixer
GPU-vectorized online context mixing with neural, unigram, bigram, trigram, and entropy experts.
parameters: {"experts":5}
Other
other
CROWN-Q curvature-weighted quantization penalty during warmdown to encourage flatter minima for quantization robustness.
parameters: null
Test-Time Training
score-first TTT
parameters: {"epochs":3,"learning_rate":0.002,"momentum":0.9,"freeze_blocks":0}
Evaluation
sliding window eval
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(i+1)"}
weight decay
parameters: {"weight_decay":0.04}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null

Novel Contributions

  • XSA applied to all 11 layers
  • Value Residual Learning
  • Gated Attention
  • CROWN-Q curvature-weighted quantization penalty
  • Depth recurrence with layers 4 and 5 repeated into 13 virtual layers
  • 5-expert Hedge Mixer for legal score-first TTT
  • Score-first test-time training with tokens scored before any weight update