PR #813

open

Record: BackoffNgramMixer (mean val_bpb=0.6671)

by hypery11View on GitHub

val_bpb

0.6671

Architecture

Transformer

Optimizer

—

Artifact Size

~16.0 MB

Training Techniques

Architecture

XSA

Uses XSA-all attention variant in an 11-layer transformer.

parameters: {"layers":11,"dim":512,"heads":"8/8 full MHA"}

LeakyReLU MLP

Uses LeakyReLU(0.5)^2 activation with a widened MLP.

parameters: {"mlp_multiplier":3.5}

BigramHash

Adds a BigramHash component.

parameters: null

SmearGate

Adds a SmearGate component.

parameters: null

Value Residual

Uses value residual connections.

parameters: null

Gated Attention

Uses gated attention.

parameters: null

BackoffNgramMixer

GPU-vectorized multi-order n-gram backoff mixer with entropy-adaptive alpha mixing and score-first backward-looking cache.

parameters: {"orders":"2-7"}

Quantization

int5

bits: 5

scope: all

Weight Averaging

EMA

parameters: null

SWA

parameters: {"type":"Tight SWA"}

Compression

zstd

level: null

Test-Time Training

score-first TTT

parameters: {"backward_looking":true,"entropy_adaptive_alpha":true}

Novel Contributions

BackoffNgramMixer with entropy-adaptive alpha mixing
GPU-vectorized multi-order n-gram backoff over orders 2-7
Score-first, backward-looking cache for inference
11-layer transformer with XSA-all and widened MLP
int5 quantization with zstd compression
EMA and Tight SWA