PR #870
openRecord: BROADSIDE — Full-Rescore N-gram Cache (val_bpb 0.0935)
by simon-marcusView on GitHub
val_bpb
0.0935
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.97 MB
Training Techniques
Architecture
GQA
Grouped query attention in the transformer backbone.
parameters: {"layers":11,"dimensions":512,"kv_heads":4,"query_heads":8}
LeakyReLU
Uses LeakyReLU(0.5)^2 activation in the MLP.
parameters: {"negative_slope":0.5}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"parallel":true,"adamw":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"pass1_store_probs":true,"pass2_rescore_all_tokens":true}
Other
other
Two-pass n-gram rescoring with a full cache built vectorially from all validation tokens, then rescoring every token using pure numpy.
parameters: {"ngram_orders":"2-12","cache_build":"np.bincount","rescore_scope":"all_tokens"}
Novel Contributions
- Full-rescore two-pass n-gram cache that rescored all tokens instead of only a subset of chunks
- Vectorized complete cache construction using np.bincount
- Pure numpy pass-2 rescoring of every token with stored per-token probabilities and entropies
- Entropy-adaptive alpha blending with per-order multipliers
- Sliding-window pass 1 that stores model probabilities for later rescoring