PR #907

open

Record: Two-Pass Order-12 Shared N-gram Tables — val_bpb 0.0960 (3-seed mean)

by resouerView on GitHub
val_bpb
0.0960
Architecture
Transformer
Optimizer
Artifact Size
~15.6 MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
GQA
Grouped query attention in the transformer architecture.
parameters: {"layers":11,"dimensions":512,"heads":"8/4"}
MLP3x
Expanded MLP width to 3x the model dimension.
parameters: {"mlp_dim":1536}
LeakyReLU
Uses LeakyReLU(0.9) squared activation.
parameters: {"squared":true,"slope":0.9}
weight tying
Not mentioned explicitly in the submission.
parameters: null
Other
other
Shared n-gram tables across all 8 GPU ranks with deterministic updates and no all_reduce.
parameters: {"ranks":8,"shared_tables":true}
other
Two-pass rescoring: first pass stores model probabilities and builds the full cache; second pass rescoring all tokens against the complete cache.
parameters: {"passes":2,"tokens":62000000}
other
Order 2-12 backoff with entropy-adaptive alpha and per-order multipliers.
parameters: {"order_min":2,"order_max":12}
other
Uses np.bincount for fast cache construction.
parameters: {"speedup_claimed":"10-50x"}

Novel Contributions

  • Shared n-gram tables updated identically across all 8 GPU ranks without all_reduce
  • Two-pass rescoring that eliminates the cold-start problem by rescoring all tokens against a fully built cache
  • Order-2-to-12 backoff with entropy-adaptive alpha and per-order multipliers
  • np.bincount-based cache construction for faster table building
  • 3-seed validation with very low variance and a sub-0.0961 val_bpb mean