PR #809

open

Record: Chunk-Based N-gram Backoff + Score-First TTT (0.295 BPB)

by AayushBaniya2006View on GitHub
val_bpb
0.2952
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.4MB

Training Techniques

Quantization
GPTQ
bits: 5
scope: all weights / exported model
Architecture
XSA
Exclusive self-attention on the last 4 layers
parameters: {"layers":4}
SmearGate
Learned per-dimension gate blending current and previous token embeddings
parameters: null
BigramHash
Hash-based bigram feature module with bucketed representation
parameters: {"buckets":4096}
MLP3x
Expanded MLP hidden size to 3.0x the model dimension
parameters: {"multiplier":3}
tied embeddings
Input and output embeddings are tied
parameters: null
KV head count
Grouped-query attention with 8 query heads and 4 KV heads
parameters: {"query_heads":8,"kv_heads":4}
Partial RoPE
Rotary positional embeddings applied to a subset of dimensions
parameters: {"dims":"16/64"}
LeakyReLU(0.9)^2
Leaky ReLU with negative slope 0.9 followed by squaring
parameters: {"negative_slope":0.9}
Initialization
OrthoInit
Orthogonal initialization for all 2D weight matrices
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"ns_steps":5,"banking":true}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"learning_rate":0.035,"scope":"embeddings"}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"learning_rate":0.025,"scope":"scalars"}
Weight Averaging
EMA
parameters: {"decay":0.997,"step_aware_warmup":true}
Polyak averaging
parameters: {"decay":0.998}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
chunk-based sequential evaluation
parameters: {"chunk_tokens":1000000}
Test-Time Training
score-first TTT
parameters: {"rank":8,"learning_rate":0.01,"chunk_size":2048,"epochs":3}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
cosine decay
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"value":0.04}
Other
other
Entropy-adaptive N-gram interpolation with per-order multipliers and score-first chunk-synchronized cache updates
parameters: {"order":9,"alpha_min":0.05,"alpha_max":0.6,"min_count":2,"num_buckets":4194304,"chunk_tokens":1000000}

Novel Contributions

  • Chunk-based order-9 N-gram backoff cache built incrementally from already-scored validation tokens
  • Score-first multi-GPU cache synchronization with all ranks updating after each chunk
  • Entropy-adaptive interpolation between model probabilities and N-gram probabilities
  • Per-order alpha multipliers that boost high-order matches and suppress low-order matches
  • Score-first TTT with LoRA rank 8 and hard enforcement of no hindsight selection
  • GPTQ int5 export to fit the artifact budget