PR #1114

open

Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200)

by minh-stakcView on GitHub
val_bpb
0.0235
Architecture
Transformer
Optimizer
Muon
Artifact Size
6.4 MB

Training Techniques

Architecture
GQA
Grouped query attention in the base Transformer.
parameters: {"layers":11,"dimensions":512,"kv_ratio":"8/4"}
LeakyReLU
MLP uses LeakyReLU squared activation.
parameters: {"mlp_multiplier":3,"activation_power":2,"slope":0.5}
XSA
Uses XSA-4 attention/sequence component.
parameters: {"variant":4}
BigramHash
Bigram hash component used in the model.
parameters: null
Value Residual
Value residual mechanism (VRL).
parameters: null
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
mixed int5/int6
bits: null
scope: MLP/attn
Regularization
magnitude pruning
parameters: {"sparsity":"3%"}
Other
other
Packed training n-gram hash tables (orders 2-13) precomputed from training data and stored in the artifact.
parameters: {"orders":"2-13","buckets":32000}
other
Hierarchical Dirichlet CTW mixing across n-gram orders with per-order concentration parameters.
parameters: {"concentrations":[50,50,20,10,6,4,3,2.5]}
other
Online n-gram cache updated score-first after each scored window.
parameters: {"orders":"2-9","buckets":4000000}

Novel Contributions

  • Packed training n-gram artifact precomputed during training and stored in the submission artifact
  • Hierarchical Dirichlet CTW mixing for combining n-gram orders
  • Combination of packed training statistics with an online score-first n-gram cache
  • Warm-started evaluation using compressed multi-order n-gram tables