PR #1092
openXSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB
by teddyowehView on GitHub
val_bpb
1.1219
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.92 MB
Training Techniques
Architecture
XSA
Extended self-attention applied to all 11 layers instead of only the last 4 layers.
parameters: {"layers":11}
LeakyReLU
LeakyReLU activation with negative slope 0.75, squared after activation.
parameters: {"negative_slope":0.75}
BigramHash
Bigram vocabulary hashing used as part of the model input representation.
parameters: {"vocab_size":2048}
VE128
Value embedding / value expansion module enabled on selected layers.
parameters: {"dim":128,"layers":[9,10]}
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":16}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.03,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"ttt_learning_rate":0.03,"ttt_epochs":3,"grad_clip":1}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
late QAT
bits: 6
scope: all
Regularization
LN scale
parameters: {"enabled":true}
LR Schedule
cosine decay
parameters: {"warmdown_steps":3500}
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
lzma
level: null
Novel Contributions
- XSA applied to all 11 layers instead of only the last 4
- LeakyReLU(0.75) squared activation variant
- Aggressive legal score-first TTT with lr=0.03 and all blocks unfrozen
- Automatic Flash Attention 3 fallback to PyTorch SDPA