PR #940

open

Record: Score-First TTT + Multi-Order N-gram Backoff (3-seed mean val_bpb=0.9581)

by antaloaalonsoView on GitHub

val_bpb

0.9581

Architecture

Transformer

Optimizer

—

Artifact Size

15.7MB

Training Techniques

Test-Time Training

score-first TTT

parameters: null

Other

other

Multi-order n-gram backoff cache using orders 2-7 with entropy-adaptive alpha mixing

parameters: {"orders":"2-7"}

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

MLP3x

MLP width expanded to 3x

parameters: null

U-Net skip connections

U-Net style skip connections in the transformer

parameters: null

LeakyReLU

LeakyReLU(0.5)^2 activation

parameters: {"negative_slope":0.5}

XSA

Exclusive self-attention applied to all layers

parameters: {"layers":11}

Value Residual

Layer 0 value output mixed into subsequent layers via learned sigmoid gates

parameters: null

Gated Attention

Per-head sigmoid gates on attention output

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Quantization

int6

bits: 6

scope: per-row

Compression

zstd

level: 16

Novel Contributions

Score-first test-time training that scores tokens under inference_mode before training on them
Multi-order n-gram backoff cache with entropy-adaptive alpha mixing
Combination of score-first TTT with backward-looking n-gram cache under competition compliance constraints
11-layer transformer with XSA on all layers, LeakyReLU(0.5)^2, Value Residual, and Gated Attention