val_bpb
0.0308
Architecture
Transformer
Optimizer
AdamW
Artifact Size
3.66MB
Training Techniques
Architecture
BackoffNgramMixer
GPU-vectorized logistic context mixer combining neural logits with order-2 through order-13 n-gram backoff probabilities.
parameters: {"max_order":13,"experts":13}
Test-Time Training
score-first TTT
parameters: {"phases":2}
Evaluation
sliding window eval
parameters: null
Quantization
int6
bits: 6
scope: final artifact
Compression
custom
level: null
Novel Contributions
- Pre-filling order-2 through order-13 n-gram tables from the full training set before the training loop
- Score-first test-time training where each validation chunk is fully scored before any weight updates
- A pretrained n-gram oracle passed into evaluation to eliminate cold-start behavior
- GPU-vectorized backoff n-gram mixer combining neural and n-gram predictions