PR #441

open

Add BigramHash: hashed bigram embeddings with optional dim projection

by CrimsonSithriaView on GitHub

val_bpb

1.2392

Architecture

Transformer

Optimizer

Adam

Artifact Size

—

Training Techniques

Architecture

BigramHash

Adds hashed bigram embeddings for (prev_token, cur_token) pairs and adds them to token representations before the first transformer block.

parameters: {"BIGRAM_BUCKETS":12288,"BIGRAM_DIM":128}

Optimizer

Adam

weight_decay: null

momentum: null

other_params: {"separate_optimizer_group":true,"bigram_lr_matches_token_embeddings":true}

Hashed bigram embeddings added to the model input representations
Optional projection from bigram embedding dimension to model dimension to reduce artifact size
Separate optimizer group for bigram parameters at token embedding learning rate
Zero-overhead disable switch via BIGRAM_BUCKETS=0