PR #912

closed

10L + PPM Full-Rescore Order-12 N-gram (0.3461 BPB)

by BortlesboatView on GitHub
val_bpb
0.3461
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3-15.6 MB

Training Techniques

Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
LeakyReLU activation used in the MLP.
parameters: {"slope":0.5}
Partial RoPE
Partial rotary positional embeddings.
parameters: null
XSA
XSA applied in the last 4 layers.
parameters: {"layers":4}
Value Residual
Value residual connections in the model.
parameters: null
Regularization
LN scale
parameters: null
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6
Compression
zstd
level: 22
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"lr":0.03}
Evaluation
sliding window eval
parameters: null
stride-based eval
parameters: {"stride":64}
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Other
other
PPM-style all-order blend over matching n-gram orders 2-12 using escape probabilities, with leave-one-out self-exclusion during full-rescore.
parameters: {"orders":[2,12]}

Novel Contributions

  • PPM-style all-order blend across matching n-gram orders 2-12 using escape probabilities
  • Leave-one-out self-exclusion in full-rescore to remove self-inclusion bias
  • Two-pass evaluation pipeline with GPU sliding-window scoring, cache build, and full-token rescore
  • Mixed int5/int6 quantization with zstd compression
  • Neural cache and per-document LoRA test-time training described in the branch README