PR #803

open

Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer

by pentxaycView on GitHub
val_bpb
0.4416
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,875,857 bytes

Training Techniques

Architecture
XSA
Uses XSA-4 attention variant in an 11-layer transformer.
parameters: {"variant":4,"layers":11}
VRL
Value Residual Learning applied to the transformer.
parameters: null
LeakyReLU(0.5)^2
Uses squared LeakyReLU activation with negative slope 0.5.
parameters: {"negative_slope":0.5}
Quantization
mixed int6/int8
bits: 6
scope: model weights
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"learning_rate":0.0005}
Weight Averaging
Polyak averaging
parameters: {"decay":0.998}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"optimizer":"AdamW","learning_rate":0.0005,"epochs_per_chunk":4,"freeze_blocks":9,"polyak_ema":0.998}
Compression
lzma
level: null
Regularization
weight decay
parameters: null
Other
other
Complementary training that downweights tokens predictable by bigram statistics so the neural model specializes on harder tokens.
parameters: {"complement_alpha":0.5}
other
Backoff n-gram mixer with orders 2-10 and greedy cascade using the highest matching order.
parameters: {"ngram_order_min":2,"ngram_order_max":10,"buckets":4194304}
other
Entropy-adaptive alpha blending between neural and n-gram probabilities.
parameters: {"alpha_base":0.2,"alpha_range":0.55,"alpha_center":3}

Novel Contributions

  • Complementary training using bigram-based loss reweighting to specialize the neural model on tokens n-gram caches cannot predict.
  • Higher eval-time n-gram mixing weight enabled by deliberately weakening the model where n-grams are strong.
  • BackoffNgramMixer with orders 2-10 and greedy highest-order matching.
  • Entropy-adaptive alpha blending based on model uncertainty.
  • Combination of AdamW test-time training with Polyak EMA and frozen early blocks.