PR #924

open

Order-16 Frozen N-gram Oracle + Score-First TTT (0.02801 BPB)

by THUQiXuanView on GitHub

val_bpb

0.0280

Architecture

Transformer

Optimizer

Muon

Artifact Size

12.8 MB

Training Techniques

Architecture

BigramHash

GPU-native multi-order backoff n-gram hashing tables for oracle predictions

parameters: {"orders":"2-16","buckets":4194304}

LeakyReLU

LeakyReLU squared activation in the MLP

parameters: {"squared":true}

GQA

Grouped query attention

parameters: {"query_heads":8,"kv_heads":4}

XSA

XSA used across all layers

parameters: {"layers":11}

VE128

Value residual enhancement in later layers

parameters: {"layers":[9,10]}

MLP3x

Three-times MLP stack

parameters: null

RoPE

Partial rotary positional embeddings

parameters: {"dimensions":"16/64"}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_interval":50}

Quantization

GPTQ-lite

bits: 6

scope: base model

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"epochs":1,"learning_rate":0.001}

Sequence Length

sequence_length

train_length: null

eval_length: 32000

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adam":true}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

Order-16 frozen n-gram oracle prefilled from all 8B training tokens
Score-first TTT where each eval chunk is fully scored before any updates
BackoffNgramMixer with GPU-native order-2 through order-16 hashing
Complementary training that downweights tokens already well predicted by the oracle
Order-16 scaling chosen as the best BPB/eval-time tradeoff under budget