PR #504

open

Non-record: TrigramHash — iso-parametric bigram(96)+trigram(32), val_bpb=1.5275 (1xH100)

by fleeb83View on GitHub

val_bpb

1.5275

Architecture

Transformer

Optimizer

AdamW with Muon

Artifact Size

15.4MB

Training Techniques

Architecture

TrigramHashEmbedding

Iso-parametric split of BigramHash(128 dim) into BigramHash(96 dim) plus new TrigramHash(32 dim) to capture 3-token co-occurrence patterns with orthogonal hash functions

parameters: {"bigram_dim":96,"trigram_dim":32,"vocab_size":10240,"hash_function":"(36313*t[i] XOR 27191*t[i-1] XOR 18731*t[i-2]) % (vocab_size - 1)"}

Quantization

mixed int6/int5

bits: null

scope: mlp, attn, bigram, trigram

Optimizer

AdamW with Muon

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

SWA

parameters: {"swa_steps":50}

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Initialization

zero-init

Zero initialization for trigram embedding and projection weights to start as no-op and learn gradually

Novel Contributions

Introduction of TrigramHashEmbedding to capture 3-token co-occurrence patterns orthogonally to bigram embeddings
Iso-parametric embedding parameter split maintaining total embedding parameters identical to SOTA BigramHash(128 dim)
Use of three independent prime multipliers in hash function to ensure orthogonal bit patterns per token position
Extension of quantization and optimizer parameter groups to include trigram embedding components
Demonstration of architecture running cleanly and artifact size within 16MB limit despite added trigram component