PR #883

open

Order-13 N-gram Oracle + Score-First TTT (0.0308 BPB)

val_bpb

0.0308

Architecture

Transformer

Optimizer

AdamW

Artifact Size

3.66MB

Training Techniques

Architecture

BackoffNgramMixer

GPU-vectorized logistic context mixer combining neural logits with order-2 through order-13 n-gram backoff probabilities.

parameters: {"max_order":13,"experts":13}

Test-Time Training

score-first TTT

parameters: {"phases":2}

Evaluation

sliding window eval

parameters: null

Quantization

int6

bits: 6

scope: final artifact

Compression

custom

level: null

Pre-filling order-2 through order-13 n-gram tables from the full training set before the training loop
Score-first test-time training where each validation chunk is fully scored before any weight updates
A pretrained n-gram oracle passed into evaluation to eliminate cold-start behavior
GPU-vectorized backoff n-gram mixer combining neural and n-gram predictions