PR #440

open

[10min/16MB] TrigramHash + EMA-SWA + Int4 QAT — val_bpb 1.2219

by Ashutosh3142857View on GitHub
val_bpb
1.2219
Architecture
Transformer
Optimizer
Artifact Size
15,892,490 bytes

Training Techniques

Architecture
TrigramHash
Adds a 3-token hashed context embedding table alongside BigramHash to capture richer token co-occurrence patterns.
parameters: {"vocab_size":2048,"dim":48}
depth increase
Uses an 11th transformer layer funded by int4 compression savings.
parameters: {"layers":11}
Weight Averaging
EMA-SWA
parameters: {"alpha":0.9}
Quantization
STE QAT
bits: 4
scope: MLP
Compression
zstd
level: 22

Novel Contributions

  • TrigramHash(2048, dim=48) to extend bigram context features to 3-token windows
  • EMA-SWA with alpha=0.9 to weight later warmdown checkpoints more heavily
  • Int4 QAT on the MLP using STE fake quantization
  • Using int4 savings to fund an 11th transformer layer