PR #916

open

10L + PPM Full-Rescore Order-12 N-gram (0.3461 BPB)

by BortlesboatView on GitHub
val_bpb
0.3461
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3-15.6 MB

Training Techniques

Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU activation used in the MLP, squared twice as indicated by LeakyReLU(0.5)^2.
parameters: {"slope":0.5}
Partial RoPE
Partial rotary positional embeddings applied with a 16/64 split.
parameters: {"ratio":"16/64"}
XSA
XSA used in the last 4 layers.
parameters: {"layers":4}
Value Residual
Value residual connections are included in the architecture.
parameters: null
BigramHash
Bigram hash module with 4096 buckets.
parameters: {"dimensions":4096}
Quantization
mixed int5/int6
bits: null
scope: MLP/attention
Compression
zstd
level: 22
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"lr":0.03}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"pass_1":"store per-token model probabilities without n-gram blending","pass_2":"rescore with frozen cache"}
Other
other
PPM-style all-order blend across matching n-gram orders 2-12 using escape probabilities.
parameters: {"orders":"2-12"}
other
Leave-one-out self-exclusion during full-cache rescoring to subtract each token's own contribution before scoring.
parameters: null

Novel Contributions

  • PPM-style all-order blend across n-gram orders 2-12 instead of hard backoff
  • Leave-one-out self-exclusion during full-cache rescoring to remove self-inclusion bias
  • Two-pass score-first evaluation pipeline with frozen cache rescoring
  • Vectorized cache construction over all tokens using np.bincount