val_bpb
0.8881
Architecture
Transformer
Optimizer
—
Artifact Size
13.99 MB
Training Techniques
Architecture
XSA-all
11-layer transformer variant using XSA-all attention.
parameters: {"layers":11,"dim":512,"gqa_heads":"8/4"}
BigramHash
Bigram hash module used as part of the architecture.
parameters: {"dimensions":128,"size":10240}
SmearGate
Gating component included in the model architecture.
parameters: null
MLP3x
Uses a 3x MLP block.
parameters: null
Gated Attention
Attention mechanism includes gating.
parameters: null
Value Residual
Adds value residual connections.
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
zstd
level: 22
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"Tight SWA"}
Evaluation
order-adaptive n-gram backoff cache
parameters: {"orders":"2-11","highest_order_first":true,"entropy_gating":true,"score_first":true,"deterministic":true}
Test-Time Training
score-first TTT
parameters: {"enabled":false}
Regularization
layerwise LN scaling
parameters: null
Novel Contributions
- 11-layer XSA-all transformer
- Order-adaptive entropy-gated n-gram backoff from orders 2 to 11
- Higher-order matches use lower entropy thresholds
- GPTQ-lite int6 quantization combined with zstd-22 compression
- Score-first, deterministic evaluation without TTT