val_bpb
0.3922
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.94 MB
Training Techniques
Quantization
QAT
bits: 6
scope: all
Compression
lzma
level: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.025}
Weight Averaging
EMA
parameters: {"decay":0.997}
Architecture
weight tying
Tied input and output embeddings.
parameters: null
U-Net skip connections
U-Net style skip connections in the model architecture.
parameters: null
SmearGate
Per-dimension gate blending each token with the previous token.
parameters: null
BigramHash
Hash-table embedding for token bigrams.
parameters: {"dimensions":"2048x128"}
XSA
Exclusive self-attention applied to the last layers to reduce self-value bias.
parameters: {"layers":4}
MLP3x
Wider MLP with 3x expansion.
parameters: null
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
GELU pre-enrichment
Wider nonlinear pre-enrichment block before transformer layers.
parameters: {"dimensions":[512,768,512]}
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- Full-vocab 1024-token normalized n-gram scoring across all tokens
- Bayesian first-match blending with a neural prior
- Collision premium analysis showing inflated pseudo-probabilities from hash collisions
- Fixed 0.5 blend outperforming adaptive gating schemes
- Two-phase shared n-gram cache with global sequential cache construction
- GELU pre-enrichment block
- XSA on the last 4 layers