val_bpb
0.6671
Architecture
Transformer
Optimizer
—
Artifact Size
~16.0 MB
Training Techniques
Architecture
XSA
Uses XSA-all attention variant in an 11-layer transformer.
parameters: {"layers":11,"dim":512,"heads":"8/8 full MHA"}
LeakyReLU MLP
Uses LeakyReLU(0.5)^2 activation with a widened MLP.
parameters: {"mlp_multiplier":3.5}
BigramHash
Adds a BigramHash component.
parameters: null
SmearGate
Adds a SmearGate component.
parameters: null
Value Residual
Uses value residual connections.
parameters: null
Gated Attention
Uses gated attention.
parameters: null
BackoffNgramMixer
GPU-vectorized multi-order n-gram backoff mixer with entropy-adaptive alpha mixing and score-first backward-looking cache.
parameters: {"orders":"2-7"}
Quantization
int5
bits: 5
scope: all
Weight Averaging
EMA
parameters: null
SWA
parameters: {"type":"Tight SWA"}
Compression
zstd
level: null
Test-Time Training
score-first TTT
parameters: {"backward_looking":true,"entropy_adaptive_alpha":true}
Novel Contributions
- BackoffNgramMixer with entropy-adaptive alpha mixing
- GPU-vectorized multi-order n-gram backoff over orders 2-7
- Score-first, backward-looking cache for inference
- 11-layer transformer with XSA-all and widened MLP
- int5 quantization with zstd compression
- EMA and Tight SWA