PR #886

open

RFC: A framework for deciding the n-gram question

by abaybektursunView on GitHub

val_bpb

0.3779

Architecture

Transformer

Optimizer

—

Artifact Size

16MB

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":64}

Architecture

BigramHash

Hash-table based n-gram caching/mixing built from already-scored tokens during evaluation.

parameters: {"orders":"2-9","buckets":4000000}

Test-Time Training

score-first TTT

parameters: null

LoRA TTT

parameters: {"epochs":8}

Quantization

int6

bits: 6

scope: all

Eval-time n-gram caching/mixing that preserves strict causality by using only already-scored tokens
Demonstration that a pure n-gram cache can outperform the neural base model on FineWeb validation
Finding that smaller hash tables with more collisions can improve BPB because collisions help counts cross the min_count threshold
Global all-reduce synchronization of n-gram hash table deltas across GPUs to avoid cache fragmentation
Proposal to cap eval-time memory or per-token latency as a competition rule clarification