PR #745
openRecord: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean)
by stukenovView on GitHub
val_bpb
1.0222
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,857,972 bytes
Training Techniques
Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
XSA
Exclusive Self-Attention applied on all virtual layers.
parameters: {"layers":13}
Value Residual Learning
Layer 0 value output is blended into subsequent attention via learned sigmoid gates.
parameters: null
Gated Attention
Per-head sigmoid gates on attention output.
parameters: null
CROWN-Q
Curvature-weighted quantization penalty during warmdown to improve int6 quantization robustness.
parameters: null
depth recurrence
Layers 4 and 5 are repeated, creating 13 virtual layers from 11 physical layers.
parameters: {"physical_layers":11,"virtual_layers":13,"repeated_layers":[4,5]}
Other
other
5-expert Hedge Mixer for online context mixing during TTT evaluation using neural, unigram, bigram, trigram, and entropy experts.
parameters: {"experts":5}
Test-Time Training
score-first TTT
parameters: {"epochs":1,"learning_rate":0.002,"momentum":0.9}
Evaluation
sliding window eval
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Compression
lzma
level: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- XSA on all layers
- Value Residual Learning
- Gated Attention
- CROWN-Q curvature-weighted quantization penalty
- Depth recurrence with repeated layers 4 and 5
- 5-expert Hedge Mixer for online context mixing during TTT
- Score-first test-time training with n-gram tables built only from already-scored tokens