PR #1176

open

Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0962 (3-seed mean)

val_bpb
1.0962
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
≤16.0 MB

Training Techniques

Architecture
XSA
Expanded XSA from the last 4 layers to all 11 layers.
parameters: {"layers":11}
LeakyReLU
Uses LeakyReLU(0.5)^2 in the MLP.
parameters: {"slope":0.5}
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
BigramHash
Bigram hash embedding component used in the model.
parameters: {"size":"2816x112"}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"embeddings_optimizer":"AdamW"}
Quantization
GPTQ
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":3}
Other
other
SLOT eval-time adaptation using a learned additive delta vector at the last hidden layer during evaluation.
parameters: {"delta_dim":512,"steps":5,"learning_rate":0.003}

Novel Contributions

  • QK_GAIN_INIT increased to 4.0
  • XSA expanded to all 11 layers
  • Muon-TTT enabled in score-first mode
  • SLOT eval-time delta optimization at the last hidden layer
  • Combined 3-seed mean val_bpb of 1.0962