PR #907

open

Record: Two-Pass Order-12 Shared N-gram Tables — val_bpb 0.0960 (3-seed mean)

val_bpb

0.0960

Architecture

Transformer

Optimizer

—

Artifact Size

~15.6 MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

GQA

Grouped query attention in the transformer architecture.

parameters: {"layers":11,"dimensions":512,"heads":"8/4"}

MLP3x

Expanded MLP width to 3x the model dimension.

parameters: {"mlp_dim":1536}

LeakyReLU

Uses LeakyReLU(0.9) squared activation.

parameters: {"squared":true,"slope":0.9}

weight tying

Not mentioned explicitly in the submission.

parameters: null

Other

other

Shared n-gram tables across all 8 GPU ranks with deterministic updates and no all_reduce.

parameters: {"ranks":8,"shared_tables":true}

other

Two-pass rescoring: first pass stores model probabilities and builds the full cache; second pass rescoring all tokens against the complete cache.

parameters: {"passes":2,"tokens":62000000}

other

Order 2-12 backoff with entropy-adaptive alpha and per-order multipliers.

parameters: {"order_min":2,"order_max":12}

other

Uses np.bincount for fast cache construction.

parameters: {"speedup_claimed":"10-50x"}

Shared n-gram tables updated identically across all 8 GPU ranks without all_reduce
Two-pass rescoring that eliminates the cold-start problem by rescoring all tokens against a fully built cache
Order-2-to-12 backoff with entropy-adaptive alpha and per-order multipliers
np.bincount-based cache construction for faster table building
3-seed validation with very low variance and a sub-0.0961 val_bpb mean