PR #809

open

Record: Chunk-Based N-gram Backoff + Score-First TTT (0.295 BPB)

by AayushBaniya2006View on GitHub

val_bpb

0.2952

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.4MB

Training Techniques

Quantization

GPTQ

bits: 5

scope: all weights / exported model

Architecture

XSA

Exclusive self-attention on the last 4 layers

parameters: {"layers":4}

SmearGate

Learned per-dimension gate blending current and previous token embeddings

parameters: null

BigramHash

Hash-based bigram feature module with bucketed representation

parameters: {"buckets":4096}

MLP3x

Expanded MLP hidden size to 3.0x the model dimension

parameters: {"multiplier":3}

tied embeddings

Input and output embeddings are tied

parameters: null

KV head count

Grouped-query attention with 8 query heads and 4 KV heads

parameters: {"query_heads":8,"kv_heads":4}

Partial RoPE

Rotary positional embeddings applied to a subset of dimensions

parameters: {"dims":"16/64"}

LeakyReLU(0.9)^2

Leaky ReLU with negative slope 0.9 followed by squaring

parameters: {"negative_slope":0.9}

Initialization

OrthoInit

Orthogonal initialization for all 2D weight matrices

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"ns_steps":5,"banking":true}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"learning_rate":0.035,"scope":"embeddings"}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"learning_rate":0.025,"scope":"scalars"}

Weight Averaging

EMA

parameters: {"decay":0.997,"step_aware_warmup":true}

Polyak averaging

parameters: {"decay":0.998}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

chunk-based sequential evaluation

parameters: {"chunk_tokens":1000000}

Test-Time Training

score-first TTT

parameters: {"rank":8,"learning_rate":0.01,"chunk_size":2048,"epochs":3}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

cosine decay

parameters: {"warmdown_steps":3500}

Regularization

weight decay

parameters: {"value":0.04}

Other

other

Entropy-adaptive N-gram interpolation with per-order multipliers and score-first chunk-synchronized cache updates

parameters: {"order":9,"alpha_min":0.05,"alpha_max":0.6,"min_count":2,"num_buckets":4194304,"chunk_tokens":1000000}

Novel Contributions

Chunk-based order-9 N-gram backoff cache built incrementally from already-scored validation tokens
Score-first multi-GPU cache synchronization with all ranks updating after each chunk
Entropy-adaptive interpolation between model probabilities and N-gram probabilities
Per-order alpha multipliers that boost high-order matches and suppress low-order matches
Score-first TTT with LoRA rank 8 and hard enforcement of no hindsight selection
GPTQ int5 export to fit the artifact budget