val_bpb
0.9623
Architecture
Transformer
Optimizer
—
Artifact Size
~15.87 MB
Training Techniques
Evaluation
sliding window eval
parameters: {"stride":64}
entropy-adaptive cache blending
parameters: {"alpha_formula":"0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}
Other
other
7-gram entropy-adaptive causal cache / PPM-style n-gram backoff blended with the neural model during evaluation
parameters: {"orders":"2-7","buckets_per_table":4194304,"min_count":2,"alpha_base":0.05,"alpha_range":0.55,"alpha_scale":2,"alpha_threshold":4}
Architecture
XSA
XSA-all applied across all 11 layers
parameters: {"layers":11}
EBLS
Empirical Bayes Layer Sharing with shared blocks and loops
parameters: {"shared_blocks":3,"loops":3}
LoRA
Low-rank adaptation used in the model
parameters: {"rank":8}
BigramHash
Bigram hash embedding/component used in the architecture
parameters: {"vocab":3072,"dim":128}
RoPE
Partial rotary positional embeddings
parameters: {"dims":"16/64"}
KV head count
Grouped-query attention with reduced KV heads
parameters: {"heads":8,"kv_heads":4}
MLP3x
Three-layer MLP with LeakyReLU squared activation
parameters: {"activation":"LeakyReLU(0.5)^2"}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval":50}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: 9
Novel Contributions
- 7-gram causal entropy-adaptive n-gram cache blended with neural predictions
- Strictly backward-looking cache updates with no oracle/min(NLL) selection
- Entropy-based adaptive alpha that increases cache weight when model entropy is high
- EBLS layer sharing with 3 shared blocks and 3 loops
- XSA-all across all 11 layers
- Val-calibrated GPTQ int6 quantization combined with LZMA compression