val_bpb
0.5440
Architecture
11-layer Transformer
Optimizer
—
Artifact Size
16.0 MB
Training Techniques
Architecture
XSA-all
Uses XSA-all attention mechanism in the transformer.
parameters: null
MLP3.5x
Expanded MLP width to 3.5x.
parameters: {"mlp_multiplier":3.5}
LeakyReLU
Uses LeakyReLU(0.5)^2 activation.
parameters: {"negative_slope":0.5,"power":2}
Quantization
int5
bits: 5
scope: all
Weight Averaging
EMA
parameters: null
SWA
parameters: {"type":"Tight SWA"}
Evaluation
order-adaptive entropy-gated BackoffNgramMixer
parameters: {"orders":"2-7 gram","per_order_entropy_thresholds":true,"score_first":true,"backward_looking":true,"deterministic":true}
Test-Time Training
score-first TTT
parameters: {"backward_looking":true}
Compression
custom
level: null
Novel Contributions
- Order-adaptive entropy-gated BackoffNgramMixer
- Per-order entropy thresholds for mixing weight selection
- Score-first, backward-looking, deterministic evaluation strategy
- 11-layer transformer with XSA-all and full MHA
- int5 quantization with compression
- EMA and Tight SWA training recipe