PR #1303

open

Record: SLOT + QK-Gain 4.0 + XSA-11 — val_bpb 0.9462 (3-seed mean)

by anthony-maioView on GitHub
val_bpb
0.9462
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.7-15.8 MB

Training Techniques

Architecture
QK-Gain
Per-head query scaling to improve performance.
parameters: {"version":4}
XSA
Expanded XSA applied across all layers.
parameters: {"layers":11}
BigramHash
Bigram hashing used in the model, with reduced hash size for artifact fit.
parameters: {"size":1024}
LeakyReLU
LeakyReLU-based MLP activation.
parameters: {"power":2}
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
Partial rotary positional embeddings.
parameters: {"train_fraction":16,"total_fraction":64}
SmearGate
SmearGate component in the architecture.
parameters: null
U-Net skip connections
U-Net style skip connections in the transformer.
parameters: null
Test-Time Training
score-first TTT
parameters: {"steps":16,"learning_rate":0.008,"min_learning_rate":0.0008}
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
lzma
level: null
Quantization
late QAT
bits: 6
scope: all
Weight Averaging
EMA + Tight SWA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Regularization
LN scale
parameters: null

Novel Contributions

  • SLOT-16 scored-position learned output tuning with per-sample hidden delta and logit bias
  • QK-Gain 4.0 per-head query scaling
  • XSA expanded to all 11 layers
  • Improved sliding-window baseline combined with test-time SLOT optimization
  • Artifact fitting via reduced BigramHash size and lzma compression