PR #440
open[10min/16MB] TrigramHash + EMA-SWA + Int4 QAT — val_bpb 1.2219
by Ashutosh3142857View on GitHub
val_bpb
1.2219
Architecture
Transformer
Optimizer
—
Artifact Size
15,892,490 bytes
Training Techniques
Architecture
TrigramHash
Adds a 3-token hashed context embedding table alongside BigramHash to capture richer token co-occurrence patterns.
parameters: {"vocab_size":2048,"dim":48}
depth increase
Uses an 11th transformer layer funded by int4 compression savings.
parameters: {"layers":11}
Weight Averaging
EMA-SWA
parameters: {"alpha":0.9}
Quantization
STE QAT
bits: 4
scope: MLP
Compression
zstd
level: 22
Novel Contributions
- TrigramHash(2048, dim=48) to extend bigram context features to 3-token windows
- EMA-SWA with alpha=0.9 to weight later warmdown checkpoints more heavily
- Int4 QAT on the MLP using STE fake quantization
- Using int4 savings to fund an 11th transformer layer