val_bpb
1.0340
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,903,061 bytes
Training Techniques
Architecture
LeakyReLU²
Uses LeakyReLU(0.5) squared in the MLP instead of relu² to improve gradient flow.
parameters: {"negative_slope":0.5}
XSA
Cross-sequence attention applied to all transformer layers instead of only the last few layers.
parameters: {"layers":11}
Quantization
GPTQ
bits: 6
scope: all
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"type":"Tight SWA"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
n-gram backoff
parameters: {"order":5,"backoff_orders":[5,4,3,2],"entropy_adaptive":true}
Test-Time Training
score-first TTT
parameters: {"cache_update_after_scoring":true}
Other
other
Entropy-adaptive alpha blending between model predictions and n-gram cache.
parameters: {"alpha_low":0.05,"alpha_high":0.4,"entropy_threshold":4}
Novel Contributions
- LeakyReLU(0.5)² MLP activation
- XSA applied to all 11 layers
- Full Hessian-based GPTQ with actorder and Cholesky error compensation
- 5-gram multi-order backoff with separate hash tables per order
- Entropy-adaptive alpha for n-gram/model mixing
- Score-first n-gram cache protocol