PR #1030

open

Record: Single-Pass Packed N-gram + Dirichlet CTW — val_bpb 0.1130 (3-seed mean)

by sofiabodView on GitHub

val_bpb

0.1130

Architecture

Transformer

Optimizer

Muon

Artifact Size

5.76 MB

Training Techniques

Architecture

weight tying

Tied embeddings used in the base Transformer.

parameters: null

RoPE

Rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":16}

SWA

Stochastic weight averaging used during training.

parameters: null

BigramHash

Bigram hash component used in the model stack.

parameters: {"dimensions":128,"buckets":4096}

SmearGate

SmearGate module used in the model stack.

parameters: null

VE128

Value residual/VE128 component used in later layers.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Quantization

int6

bits: 6

scope: per-row

Compression

zstd

level: 22

Regularization

logit softcap

parameters: {"value":30}

Weight Averaging

EMA

parameters: {"decay":0.997}

Tight SWA

parameters: null

Evaluation

sliding window eval

parameters: {"stride":128,"seq_len":2048}

Other

other

Single-pass score-first evaluation with packed multi-order n-gram cache and hierarchical Dirichlet CTW mixing.

parameters: {"orders":"2-13","buckets_per_order":131072,"concentrations":[50,50,20,10,6,4,3,2.5,2,1.8,1.6,1.4]}

Novel Contributions

Packed multi-order n-gram artifact precomputed from training shards to eliminate cold-start cache issues
Hierarchical Dirichlet CTW mixing across n-gram orders
Single-pass score-first evaluation with no two-pass rescore
Deterministic distributed cache prefill for warm-started evaluation
Ratio-preserving packed uint16 n-gram counts stored in a compressed artifact