PR #1176
openRecord: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0962 (3-seed mean)
by bigbagView on GitHub
val_bpb
1.0962
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
≤16.0 MB
Training Techniques
Architecture
XSA
Expanded XSA from the last 4 layers to all 11 layers.
parameters: {"layers":11}
LeakyReLU
Uses LeakyReLU(0.5)^2 in the MLP.
parameters: {"slope":0.5}
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
BigramHash
Bigram hash embedding component used in the model.
parameters: {"size":"2816x112"}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"embeddings_optimizer":"AdamW"}
Quantization
GPTQ
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":3}
Other
other
SLOT eval-time adaptation using a learned additive delta vector at the last hidden layer during evaluation.
parameters: {"delta_dim":512,"steps":5,"learning_rate":0.003}
Novel Contributions
- QK_GAIN_INIT increased to 4.0
- XSA expanded to all 11 layers
- Muon-TTT enabled in score-first mode
- SLOT eval-time delta optimization at the last hidden layer
- Combined 3-seed mean val_bpb of 1.0962