PR #931

open

Record: 0.0498 bpb - Packed Training N-gram Artifact + Learned Weighting Gate

by AnirudhRahulView on GitHub

val_bpb

0.0498

Architecture

Transformer

Optimizer

—

Artifact Size

15,857,871 bytes

Training Techniques

Architecture

BigramHash

Removed the bigram hash path to make room for the packed training n-gram cache while retaining warm low-order n-gram signal through the artifact.

parameters: {"vocab_size":0}

SmearGate

Learned multi-expert weighting gate over the neural model and n-gram experts for orders 2 through 9.

parameters: {"experts":"neural + n-gram order 2..9"}

Other

other

Packed a 32K-bucket order-2..9 training n-gram cache into the artifact as 32-bit count tables so evaluation starts with a pre-warmed cache.

parameters: {"buckets":32768,"orders":"2..9","count_table_bits":32}

Test-Time Training

score-first TTT

parameters: {"chunk_tokens":131072,"temperature":0.85,"freeze_blocks":2,"epochs":2,"learning_rate":0.0001}

Evaluation

stride-based eval

parameters: {"stride":64}

Regularization

magnitude pruning

parameters: {"prune_pct":0.05}