PR #846

open

Record: Two-Pass N-gram Rescoring (val_bpb 0.1434)

by himanshudongreView on GitHub

val_bpb

0.1434

Architecture

Transformer

Optimizer

Muon

Artifact Size

13.4 MB

Training Techniques

Quantization

GPTQ

bits: 5

scope: all

Architecture

LeakyReLU(0.9)^2

Uses a LeakyReLU squared activation variant in the transformer.

parameters: {"slope":0.9}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"embeddings_optimizer":"AdamW"}

Weight Averaging

EMA

parameters: {"decays":[0.995,0.996,0.997]}

Evaluation

two-pass n-gram rescoring

parameters: {"rescore_chunks":15,"cold_cache_rescoring":true}

Test-Time Training

score-first TTT

parameters: {"optimizer":"AdamW","temperature":0.98,"chunk_size":2048}

Other

other

Entropy-adaptive order-2-to-9 n-gram backoff with 4M hash buckets.

parameters: {"order_range":"2-9","hash_buckets":4000000}

Two-pass n-gram evaluation that rescoring early chunks with the complete cache
Cold-cache penalty reduction for early validation chunks
Backward-looking compliant rescoring of tokens already evaluated in pass 1
Combination of score-first TTT, GPTQ-Int5 export, and n-gram rescoring in a single pipeline