PR #803

open

Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer

by pentxaycView on GitHub

val_bpb

0.4416

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,875,857 bytes

Training Techniques

Architecture

XSA

Uses XSA-4 attention variant in an 11-layer transformer.

parameters: {"variant":4,"layers":11}

VRL

Value Residual Learning applied to the transformer.

parameters: null

LeakyReLU(0.5)^2

Uses squared LeakyReLU activation with negative slope 0.5.

parameters: {"negative_slope":0.5}

Quantization

mixed int6/int8

bits: 6

scope: model weights

Optimizer

AdamW

weight_decay: null

momentum: null

other_params: {"learning_rate":0.0005}

Weight Averaging

Polyak averaging

parameters: {"decay":0.998}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"optimizer":"AdamW","learning_rate":0.0005,"epochs_per_chunk":4,"freeze_blocks":9,"polyak_ema":0.998}

Compression

lzma

level: null

Regularization

weight decay

parameters: null

Other

other

Complementary training that downweights tokens predictable by bigram statistics so the neural model specializes on harder tokens.

parameters: {"complement_alpha":0.5}

other

Backoff n-gram mixer with orders 2-10 and greedy cascade using the highest matching order.

parameters: {"ngram_order_min":2,"ngram_order_max":10,"buckets":4194304}

other

Entropy-adaptive alpha blending between neural and n-gram probabilities.

parameters: {"alpha_base":0.2,"alpha_range":0.55,"alpha_center":3}

Novel Contributions

Complementary training using bigram-based loss reweighting to specialize the neural model on tokens n-gram caches cannot predict.
Higher eval-time n-gram mixing weight enabled by deliberately weakening the model where n-grams are strong.
BackoffNgramMixer with orders 2-10 and greedy highest-order matching.
Entropy-adaptive alpha blending based on model uncertainty.
Combination of AdamW test-time training with Polyak EMA and frozen early blocks.