PR #441
openAdd BigramHash: hashed bigram embeddings with optional dim projection
by CrimsonSithriaView on GitHub
val_bpb
1.2392
Architecture
Transformer
Optimizer
Adam
Artifact Size
—
Training Techniques
Architecture
BigramHash
Adds hashed bigram embeddings for (prev_token, cur_token) pairs and adds them to token representations before the first transformer block.
parameters: {"BIGRAM_BUCKETS":12288,"BIGRAM_DIM":128}
Optimizer
Adam
weight_decay: null
momentum: null
other_params: {"separate_optimizer_group":true,"bigram_lr_matches_token_embeddings":true}
Novel Contributions
- Hashed bigram embeddings added to the model input representations
- Optional projection from bigram embedding dimension to model dimension to reduce artifact size
- Separate optimizer group for bigram parameters at token embedding learning rate
- Zero-overhead disable switch via BIGRAM_BUCKETS=0