PR #1369

open

Non-record: Negative results from gated multi-order hash n-grams

by xiayicheng3-codeView on GitHub

val_bpb

1.1196

Architecture

Transformer

Optimizer

—

Artifact Size

15,937,956 bytes

Training Techniques

Architecture

BigramHash

Multi-order hash n-gram prior replacing or extending the baseline bigram hash embedding with 2/3/4-gram hashed features and gating.

parameters: {"orders":[2,3,4],"hashes":[1,1,1]}

BigramHash

Earlier variant used multiple hash heads per n-gram order with hard top-1 routing / candidate-level gating.

parameters: {"hashes":[2,2,2]}

Weight Averaging

EMA

parameters: null

Quantization

GPTQ

bits: 6

scope: all

Evaluation

sliding window eval

parameters: {"stride":64}

Regularization

magnitude pruning

parameters: {"selective":true}

repeat penalty

parameters: {"strength":"mild"}

Other

other

Prompt-bank autoregressive calibration source for GPTQ calibration.

parameters: null

other

Next-shard prefetch to reduce training-time and post-processing I/O stalls.

parameters: null

other

Asynchronous logging to reduce hidden runtime overhead.

parameters: null

Documented negative results showing that gated multi-order hash n-grams did not beat the BigramHash baseline in the 10 minute / 16 MB regime.
Showed that the best legal variant came from removing same-order hash competition with NGRAM_NUM_HASHES=1,1,1.
Found that tail-loss emphasis did not improve the roundtrip-to-sliding advantage.
Added practical experimentation improvements including prompt-bank AR calibration, printed AR calibration outputs, next-shard prefetch, async logging, and faster selective-prune search.
Argued that stable hash collisions may act as useful pseudo-identity features rather than pure noise.