PR #907
openRecord: Two-Pass Order-12 Shared N-gram Tables — val_bpb 0.0960 (3-seed mean)
by resouerView on GitHub
val_bpb
0.0960
Architecture
Transformer
Optimizer
—
Artifact Size
~15.6 MB
Training Techniques
Quantization
int6
bits: 6
scope: all
Architecture
GQA
Grouped query attention in the transformer architecture.
parameters: {"layers":11,"dimensions":512,"heads":"8/4"}
MLP3x
Expanded MLP width to 3x the model dimension.
parameters: {"mlp_dim":1536}
LeakyReLU
Uses LeakyReLU(0.9) squared activation.
parameters: {"squared":true,"slope":0.9}
weight tying
Not mentioned explicitly in the submission.
parameters: null
Other
other
Shared n-gram tables across all 8 GPU ranks with deterministic updates and no all_reduce.
parameters: {"ranks":8,"shared_tables":true}
other
Two-pass rescoring: first pass stores model probabilities and builds the full cache; second pass rescoring all tokens against the complete cache.
parameters: {"passes":2,"tokens":62000000}
other
Order 2-12 backoff with entropy-adaptive alpha and per-order multipliers.
parameters: {"order_min":2,"order_max":12}
other
Uses np.bincount for fast cache construction.
parameters: {"speedup_claimed":"10-50x"}
Novel Contributions
- Shared n-gram tables updated identically across all 8 GPU ranks without all_reduce
- Two-pass rescoring that eliminates the cold-start problem by rescoring all tokens against a fully built cache
- Order-2-to-12 backoff with entropy-adaptive alpha and per-order multipliers
- np.bincount-based cache construction for faster table building
- 3-seed validation with very low variance and a sub-0.0961 val_bpb mean