PR #870

open

Record: BROADSIDE — Full-Rescore N-gram Cache (val_bpb 0.0935)

by simon-marcusView on GitHub
val_bpb
0.0935
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.97 MB

Training Techniques

Architecture
GQA
Grouped query attention in the transformer backbone.
parameters: {"layers":11,"dimensions":512,"kv_heads":4,"query_heads":8}
LeakyReLU
Uses LeakyReLU(0.5)^2 activation in the MLP.
parameters: {"negative_slope":0.5}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"parallel":true,"adamw":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"pass1_store_probs":true,"pass2_rescore_all_tokens":true}
Other
other
Two-pass n-gram rescoring with a full cache built vectorially from all validation tokens, then rescoring every token using pure numpy.
parameters: {"ngram_orders":"2-12","cache_build":"np.bincount","rescore_scope":"all_tokens"}

Novel Contributions

  • Full-rescore two-pass n-gram cache that rescored all tokens instead of only a subset of chunks
  • Vectorized complete cache construction using np.bincount
  • Pure numpy pass-2 rescoring of every token with stored per-token probabilities and entropies
  • Entropy-adaptive alpha blending with per-order multipliers
  • Sliding-window pass 1 that stores model probabilities for later rescoring