PR #1092

open

XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB

by teddyowehView on GitHub

val_bpb

1.1219

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.92 MB

Training Techniques

Architecture

XSA

Extended self-attention applied to all 11 layers instead of only the last 4 layers.

parameters: {"layers":11}

LeakyReLU

LeakyReLU activation with negative slope 0.75, squared after activation.

parameters: {"negative_slope":0.75}

BigramHash

Bigram vocabulary hashing used as part of the model input representation.

parameters: {"vocab_size":2048}

VE128

Value embedding / value expansion module enabled on selected layers.

parameters: {"dim":128,"layers":[9,10]}

RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":16}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.03,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"ttt_learning_rate":0.03,"ttt_epochs":3,"grad_clip":1}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Quantization

late QAT

bits: 6

scope: all

Regularization

LN scale

parameters: {"enabled":true}

LR Schedule

cosine decay

parameters: {"warmdown_steps":3500}

Evaluation

sliding window eval

parameters: {"stride":64}

Compression

lzma

level: null

Novel Contributions

XSA applied to all 11 layers instead of only the last 4
LeakyReLU(0.75) squared activation variant
Aggressive legal score-first TTT with lr=0.03 and all blocks unfrozen
Automatic Flash Attention 3 fallback to PyTorch SDPA