PR #933

open

Record: CacheMoney — 0.0804 BPB (3-seed mean, std 0.00003)

by haikosysView on GitHub

val_bpb

0.0804

Architecture

Transformer

Optimizer

Muon

Artifact Size

7.47 MB

Training Techniques

Architecture

BigramHash

Bigram hash cache component used in the model/cache system.

parameters: {"dimensions":128,"buckets":2048}

weight tying

Tied embeddings.

parameters: null

LeakyReLU

LeakyReLU squared activation.

parameters: {"negative_slope":0.5}

XSA

XSA used in the last 4 layers.

parameters: {"layers":4}

MLP3x

Three-times wider MLP block.

parameters: {"hidden_size":768}

Quantization

FP16

bits: 16

scope: all

other_params: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"lr":0.025}

AdamW

weight_decay: null

momentum: null

other_params: {"embeddings_lr":0.035,"scalars_lr":0.025}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Evaluation

sliding window eval

parameters: null

two-pass full-rescore

parameters: {"pass_1":"neural eval","pass_2":"sequential rescore"}

temperature sharpening

parameters: {"temperature":0.85}

Other

other

Online alpha calibration via grid search on the first 5% of scored tokens.

parameters: {"alpha_high":0.99,"entropy_thresh":3}

other

Leave-one-out scoring for cache probabilities to remove self-inclusion bias.

parameters: null

other

Sequential blend where n-gram cache is applied before phrase cache.

parameters: null

Novel Contributions

Cache-first submission where the cache dominates prediction quality and the neural model mainly provides blend probabilities.
Leave-one-out correction for two-pass cache scoring to remove self-inclusion bias.
Online alpha calibration on a small prefix of scored tokens to tune cache trust aggressively.
Two-pass full-rescore cache pipeline combining n-gram and phrase caches sequentially.
Temperature sharpening to improve model entropy and blending calibration.
Tiny FP16 model used primarily as a probability estimator rather than the main predictor.