PR #889

open

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean)

by anthony-maioView on GitHub
val_bpb
0.9642
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.95 MB

Training Techniques

Architecture
LeakyReLU
Uses squared LeakyReLU activation in the MLP.
parameters: {"power":2,"slope":0.5}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
VRL
Value Residual Learning module.
parameters: null
VE128
Value embedding dimension setting.
parameters: {"dimensions":128}
BigramHash
Bigram hash feature with 2048 buckets.
parameters: {"dimensions":2048}
XSA
XSA4 attention/sequence module.
parameters: {"variant":4}
Partial RoPE
Partial rotary positional embedding applied to a subset of dimensions.
parameters: {"train":16,"eval":64}
SmearGate
SmearGate gating mechanism.
parameters: null
U-Net skip connections
U-Net style skip connections in the network.
parameters: null
Weight Averaging
EMA + Tight SWA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 6
scope: model
Compression
lzma
level: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Initialization
OrthoInit
Orthogonal initialization.
Regularization
LN scale
parameters: null
Evaluation
sliding window eval
parameters: null
Other
other
Entropy-adaptive n-gram backoff cache built causally from already-scored tokens, mixing neural and n-gram probabilities with score-first updates.
parameters: {"orders":"2-7gram","alpha_formula":"0.05 + 0.55 * sigmoid(2*(H-4))","min_count":2,"hash_buckets_per_order":4000000}

Novel Contributions

  • Entropy-adaptive multi-order n-gram backoff cache
  • Score-first causal n-gram table updates during evaluation
  • Linear interpolation of neural and n-gram probabilities based on model entropy
  • Multi-seed record result with 0.9642 val_bpb mean