PR #813

open

Record: BackoffNgramMixer (mean val_bpb=0.6671)

by hypery11View on GitHub
val_bpb
0.6671
Architecture
Transformer
Optimizer
Artifact Size
~16.0 MB

Training Techniques

Architecture
XSA
Uses XSA-all attention variant in an 11-layer transformer.
parameters: {"layers":11,"dim":512,"heads":"8/8 full MHA"}
LeakyReLU MLP
Uses LeakyReLU(0.5)^2 activation with a widened MLP.
parameters: {"mlp_multiplier":3.5}
BigramHash
Adds a BigramHash component.
parameters: null
SmearGate
Adds a SmearGate component.
parameters: null
Value Residual
Uses value residual connections.
parameters: null
Gated Attention
Uses gated attention.
parameters: null
BackoffNgramMixer
GPU-vectorized multi-order n-gram backoff mixer with entropy-adaptive alpha mixing and score-first backward-looking cache.
parameters: {"orders":"2-7"}
Quantization
int5
bits: 5
scope: all
Weight Averaging
EMA
parameters: null
SWA
parameters: {"type":"Tight SWA"}
Compression
zstd
level: null
Test-Time Training
score-first TTT
parameters: {"backward_looking":true,"entropy_adaptive_alpha":true}

Novel Contributions

  • BackoffNgramMixer with entropy-adaptive alpha mixing
  • GPU-vectorized multi-order n-gram backoff over orders 2-7
  • Score-first, backward-looking cache for inference
  • 11-layer transformer with XSA-all and widened MLP
  • int5 quantization with zstd compression
  • EMA and Tight SWA