PR #940

open

Record: Score-First TTT + Multi-Order N-gram Backoff (3-seed mean val_bpb=0.9581)

by antaloaalonsoView on GitHub
val_bpb
0.9581
Architecture
Transformer
Optimizer
Artifact Size
15.7MB

Training Techniques

Test-Time Training
score-first TTT
parameters: null
Other
other
Multi-order n-gram backoff cache using orders 2-7 with entropy-adaptive alpha mixing
parameters: {"orders":"2-7"}
Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
MLP3x
MLP width expanded to 3x
parameters: null
U-Net skip connections
U-Net style skip connections in the transformer
parameters: null
LeakyReLU
LeakyReLU(0.5)^2 activation
parameters: {"negative_slope":0.5}
XSA
Exclusive self-attention applied to all layers
parameters: {"layers":11}
Value Residual
Layer 0 value output mixed into subsequent layers via learned sigmoid gates
parameters: null
Gated Attention
Per-head sigmoid gates on attention output
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Quantization
int6
bits: 6
scope: per-row
Compression
zstd
level: 16

Novel Contributions

  • Score-first test-time training that scores tokens under inference_mode before training on them
  • Multi-order n-gram backoff cache with entropy-adaptive alpha mixing
  • Combination of score-first TTT with backward-looking n-gram cache under competition compliance constraints
  • 11-layer transformer with XSA on all layers, LeakyReLU(0.5)^2, Value Residual, and Gated Attention