val_bpb
0.3461
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3-15.6 MB
Training Techniques
Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU activation used in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Partial rotary positional embeddings.
parameters: null
XSA
XSA applied in the last 4 layers.
parameters: {"layers":4}
Value Residual
Value residual connections in the model.
parameters: null
Regularization
LN scale
parameters: null
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6
Compression
zstd
level: 22
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"lr":0.03}
Evaluation
sliding window eval
parameters: null
stride-based eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Other
other
PPM-style all-order blend over matching n-gram orders 2-12 using escape probabilities, with leave-one-out self-exclusion during full-rescore.
parameters: {"orders":[2,12]}
Novel Contributions
- PPM-style all-order blend across matching n-gram orders 2-12 using escape probabilities
- Leave-one-out self-exclusion in full-rescore to remove self-inclusion bias
- Two-pass evaluation pipeline with GPU sliding-window scoring, cache build, and full-token rescore
- Mixed int5/int6 quantization with zstd compression
- Neural cache and per-document LoRA test-time training described in the branch README