val_bpb
1.0465
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.99 MB
Training Techniques
Architecture
XSA
Exclusive Self-Attention applied on all 11 layers
parameters: {"layers":11}
LeakyReLU(0.5)^2 MLP
MLP uses LeakyReLU with slope 0.5 squared, with 3x expansion
parameters: {"expansion":3}
BigramHash
Bigram hash feature module
parameters: {"dimensions":10240}
SmearGate
Gating mechanism used in the architecture
parameters: null
Value Residual
Adds value residual connections
parameters: null
Gated Attention
Attention mechanism with gating
parameters: null
U-Net skip
U-Net style skip connections
parameters: null
tied embeddings
Input and output embeddings are tied
parameters: null
Regularization
LN scaling
parameters: {"scale":"1/sqrt(layer+1)"}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
zstd
level: 22
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"Tight SWA"}
Evaluation
backward-looking eval cache
parameters: {"order":7,"alpha":0.4,"buckets":4000000,"min_count":2,"deterministic":true,"score_first":true}
Test-Time Training
score-first TTT
parameters: {"deterministic":true,"enabled":false}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
AdamW
weight_decay: null
momentum: null
other_params: null
Novel Contributions
- 11-layer Transformer with XSA applied to all layers
- 7-gram backward-looking evaluation cache with fixed alpha and hash buckets
- GPTQ-lite int6 quantization combined with zstd-22 compression
- EMA, Tight SWA, and Late QAT training pipeline
- Use of BigramHash and SmearGate architectural components
- Score-first deterministic evaluation without TTT