PR #1114

open

Record: Packed N-gram + Dirichlet CTW — val_bpb 0.0235 (1xB200)

by minh-stakcView on GitHub

val_bpb

0.0235

Architecture

Transformer

Optimizer

Muon

Artifact Size

6.4 MB

Training Techniques

Architecture

GQA

Grouped query attention in the base Transformer.

parameters: {"layers":11,"dimensions":512,"kv_ratio":"8/4"}

LeakyReLU

MLP uses LeakyReLU squared activation.

parameters: {"mlp_multiplier":3,"activation_power":2,"slope":0.5}

XSA

Uses XSA-4 attention/sequence component.

parameters: {"variant":4}

BigramHash

Bigram hash component used in the model.

parameters: null

Value Residual

Value residual mechanism (VRL).

parameters: null

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

mixed int5/int6

bits: null

scope: MLP/attn

Regularization

magnitude pruning

parameters: {"sparsity":"3%"}

Other

other

Packed training n-gram hash tables (orders 2-13) precomputed from training data and stored in the artifact.

parameters: {"orders":"2-13","buckets":32000}

other

Hierarchical Dirichlet CTW mixing across n-gram orders with per-order concentration parameters.

parameters: {"concentrations":[50,50,20,10,6,4,3,2.5]}

other

Online n-gram cache updated score-first after each scored window.

parameters: {"orders":"2-9","buckets":4000000}

Novel Contributions

Packed training n-gram artifact precomputed during training and stored in the submission artifact
Hierarchical Dirichlet CTW mixing for combining n-gram orders
Combination of packed training statistics with an online score-first n-gram cache
Warm-started evaluation using compressed multi-order n-gram tables