PR #870

open

Record: BROADSIDE — Full-Rescore N-gram Cache (val_bpb 0.0935)

by simon-marcusView on GitHub

val_bpb

0.0935

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.97 MB

Training Techniques

Architecture

GQA

Grouped query attention in the transformer backbone.

parameters: {"layers":11,"dimensions":512,"kv_heads":4,"query_heads":8}

LeakyReLU

Uses LeakyReLU(0.5)^2 activation in the MLP.

parameters: {"negative_slope":0.5}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"parallel":true,"adamw":true}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"pass1_store_probs":true,"pass2_rescore_all_tokens":true}

Other

other

Two-pass n-gram rescoring with a full cache built vectorially from all validation tokens, then rescoring every token using pure numpy.

parameters: {"ngram_orders":"2-12","cache_build":"np.bincount","rescore_scope":"all_tokens"}

Full-rescore two-pass n-gram cache that rescored all tokens instead of only a subset of chunks
Vectorized complete cache construction using np.bincount
Pure numpy pass-2 rescoring of every token with stored per-token probabilities and entropies
Entropy-adaptive alpha blending with per-order multipliers
Sliding-window pass 1 that stores model probabilities for later rescoring