val_bpb
1.0945
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB
Training Techniques
Architecture
BigramHash
Uses a bigram hash component in the model stack.
parameters: {"size":1536}
XSA
Applies XSA to the last layers of the model.
parameters: {"layers":4}
RoPE
Uses partial rotary positional embeddings.
parameters: {"dimensions":16,"total_dimensions":64}
VE128
Adds value residual enhancement in selected layers.
parameters: {"layers":[9,10],"dimension":128}
MLP3x
Uses a 3x MLP stack.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"squared":true,"negative_slope":0.5}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
GPTQ-lite
bits: 6
scope: model
Compression
lzma
level: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"adam_weight_decay":0.04,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
N-gram cache with entropy-adaptive alpha interpolates byte-level N-gram predictions with model logits during evaluation.
parameters: {"max_order":7,"alpha":0.5,"nll_threshold":2.5,"adaptive_range":[0.1,2],"backoff":"strict"}
Test-Time Training
TTT
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Novel Contributions
- N-gram cache replaces TTT for evaluation-time adaptation
- Entropy-adaptive alpha scales cache interpolation by token uncertainty
- Strict backoff N-gram cache with order 7 to 2
- CPU-overlapped N-gram scoring alongside GPU sliding window evaluation
- Achieves 1.0945 BPB with 3-seed consistency and sub-16MB artifacts