PR #733

closed

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean)

by stukenovView on GitHub

val_bpb

1.0278

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.8 MB

Training Techniques

Architecture

XSA

Exclusive Self-Attention applied to all layers instead of only the last few layers.

parameters: {"layers":11}

Value Residual Learning

Blends layer 0 value outputs into subsequent attention via learned sigmoid gates.

parameters: null

Gated Attention

Per-head sigmoid gates on attention outputs.

parameters: null

depth recurrence

Layers 4 and 5 are repeated to create 13 virtual layers from 11 physical layers.

parameters: {"physical_layers":11,"virtual_layers":13,"repeated_layers":[4,5]}

Hedge Mixer

GPU-vectorized online context mixing with neural, unigram, bigram, trigram, and entropy experts.

parameters: {"experts":5}

Other

other

CROWN-Q curvature-weighted quantization penalty during warmdown to encourage flatter minima for quantization robustness.

parameters: null

Test-Time Training

score-first TTT

parameters: {"epochs":3,"learning_rate":0.002,"momentum":0.9,"freeze_blocks":0}

Evaluation

sliding window eval

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(i+1)"}

weight decay

parameters: {"weight_decay":0.04}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: null

Novel Contributions

XSA applied to all 11 layers
Value Residual Learning
Gated Attention
CROWN-Q curvature-weighted quantization penalty
Depth recurrence with layers 4 and 5 repeated into 13 virtual layers
5-expert Hedge Mixer for legal score-first TTT
Score-first test-time training with tokens scored before any weight update