val_bpb
1.0803
Architecture
Transformer
Optimizer
SGD
Artifact Size
15,977,914 bytes
Training Techniques
Test-Time Training
score-first TTT
parameters: {"enabled":true,"learning_rate":0.002,"epochs":3}
Architecture
BigramHash
Hash-bucketed causal PPM experts using rolling-hash context tables for global and document-local token-context modeling.
parameters: {"global_order":6,"local_order":8,"global_buckets":2048,"local_buckets":2048}
Partial RoPE
Partial rotary positional embedding applied to a subset of dimensions.
parameters: {"dimensions":16,"base_dimensions":64}
XSA
Cross/self-attention style architectural modification used in the base stack.
parameters: {"layers":4}
VE128
Value embedding / value residual style component used in later layers.
parameters: {"layers":[9,10]}
LeakyReLU
LeakyReLU squared MLP activation variant in the base model.
parameters: {"variant":"squared"}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_interval":50}
Quantization
GPTQ-lite
bits: 6
scope: model weights
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Other
other
Fixed-share Bayesian mixer blending neural, global PPM, and local PPM experts with deterministic chunk-wise posterior updates.
parameters: {"share":0.005,"prior":[0.9,0.07,0.03]}
other
GPU-vectorized causal scoring with FNV rolling hashes, hash-bucketed count tables, and prefix-rank counter for chunk-local scoring.
parameters: null
Novel Contributions
- DualClock mixture of neural, global PPM, and document-local PPM experts
- Fixed-share Bayesian mixer with deterministic chunk-wise weight updates
- GPU-vectorized causal PPM scoring using FNV rolling hashes and hash-bucketed count tables
- Prefix-rank counter for parallel chunk-local causal scoring without leakage
- Score-before-update legality for both TTT and mixture updates