PR #441

open

Add BigramHash: hashed bigram embeddings with optional dim projection

by CrimsonSithriaView on GitHub
val_bpb
1.2392
Architecture
Transformer
Optimizer
Adam
Artifact Size

Training Techniques

Architecture
BigramHash
Adds hashed bigram embeddings for (prev_token, cur_token) pairs and adds them to token representations before the first transformer block.
parameters: {"BIGRAM_BUCKETS":12288,"BIGRAM_DIM":128}
Optimizer
Adam
weight_decay: null
momentum: null
other_params: {"separate_optimizer_group":true,"bigram_lr_matches_token_embeddings":true}

Novel Contributions

  • Hashed bigram embeddings added to the model input representations
  • Optional projection from bigram embedding dimension to model dimension to reduce artifact size
  • Separate optimizer group for bigram parameters at token embedding learning rate
  • Zero-overhead disable switch via BIGRAM_BUCKETS=0