PR #851

open

submission: Order-Adaptive N-gram Cache — 0.2071 BPB

by RoyiRaView on GitHub

val_bpb

0.2071

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.5 MB

Training Techniques

Quantization

GPTQ

bits: 5

scope: all

Compression

zstd

level: 22

Architecture

BigramHash

Added hashed n-gram / bigram cache component with multi-order backoff and order-adaptive gating.

parameters: {"orders":"2-9","buckets":4000000,"dim":128}

XSA

Applied XSA across all layers.

parameters: {"layers":11,"window_size":8}

MLP3x

Expanded MLP with LeakyReLU-based nonlinearity.

parameters: {"multiplier":3.5}

VE128

Added VE128 in upper layers.

parameters: {"layers":"9-10"}

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"learning_rate":0.0001,"epochs":4}

Weight Averaging

Polyak averaging

parameters: {"decay":0.998}

Evaluation

stride-based eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"epochs":4,"learning_rate":0.0001,"freeze_blocks":2,"chunk_tokens":131072}

LR Schedule

adaptive cosine decay

parameters: {"adaptive_lr":true,"adaptive_lr_max":3}

Regularization

CROWN-Q penalty

parameters: {"lambda":0.01}

pruning

parameters: {"pct":0.05,"type":"magnitude"}

Other

other

Order-adaptive entropy gating for n-gram cache with per-order thresholds and alpha multipliers.

parameters: {"high_order":9,"low_order":2}

other

Full-chunk cache sharing across all GPU ranks to increase n-gram data per rank.

parameters: {"ranks":8}

other

Adaptive temperature sharpening applied per token to compensate for under-confidence after quantization.

parameters: {"temperature":0.85}

other

Online logit calibration using momentum-EMA of empirical frequency versus predicted probability.

parameters: null

other

5-expert Hedge mixer combining neural, unigram, bigram, trigram, and entropy experts.

parameters: {"eta":0.1}

Novel Contributions

Order-adaptive entropy gating for n-gram cache
Multi-order n-gram backoff cache with orders 2-9
Full-chunk cache sharing across 8 GPU ranks
Score-first test-time training with Polyak EMA
Adaptive temperature sharpening
Online logit calibration
5-expert Hedge mixer
CROWN-Q plus GPTQ int5 with pruning and zstd compression