PR #1369

open

Non-record: Negative results from gated multi-order hash n-grams

by xiayicheng3-codeView on GitHub
val_bpb
1.1196
Architecture
Transformer
Optimizer
Artifact Size
15,937,956 bytes

Training Techniques

Architecture
BigramHash
Multi-order hash n-gram prior replacing or extending the baseline bigram hash embedding with 2/3/4-gram hashed features and gating.
parameters: {"orders":[2,3,4],"hashes":[1,1,1]}
BigramHash
Earlier variant used multiple hash heads per n-gram order with hard top-1 routing / candidate-level gating.
parameters: {"hashes":[2,2,2]}
Weight Averaging
EMA
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
magnitude pruning
parameters: {"selective":true}
repeat penalty
parameters: {"strength":"mild"}
Other
other
Prompt-bank autoregressive calibration source for GPTQ calibration.
parameters: null
other
Next-shard prefetch to reduce training-time and post-processing I/O stalls.
parameters: null
other
Asynchronous logging to reduce hidden runtime overhead.
parameters: null

Novel Contributions

  • Documented negative results showing that gated multi-order hash n-grams did not beat the BigramHash baseline in the 10 minute / 16 MB regime.
  • Showed that the best legal variant came from removing same-order hash competition with NGRAM_NUM_HASHES=1,1,1.
  • Found that tail-loss emphasis did not improve the roundtrip-to-sliding advantage.
  • Added practical experimentation improvements including prompt-bank AR calibration, printed AR calibration outputs, next-shard prefetch, async logging, and faster selective-prune search.
  • Argued that stable hash collisions may act as useful pseudo-identity features rather than pure noise.