PR #913

open

Record: Cache Is All You Need — val_bpb 0.0887, 622KB artifact (3-seed mean)

by RoyiRaView on GitHub

val_bpb

0.0887

Architecture

Transformer

Optimizer

Muon

Artifact Size

622 KB

Training Techniques

Architecture

weight tying

Input and output embeddings are tied.

parameters: null

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":4,"kv_heads":2}

Regularization

logit softcap

parameters: {"value":30}

logit softcap

parameters: {"value":30}

Quantization

int8

bits: 8

scope: baseline model

Compression

zlib

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Evaluation

sliding window eval

parameters: null

Other

other

Eval-time n-gram cache with order-adaptive entropy gating and adaptive blending with model probabilities.

parameters: {"ngram_order":"2-12"}

other

Eval-time long phrase cache using hashed phrase probes at multiple lengths.

parameters: {"phrase_lengths":[64,56,48,36,28,20,16]}

Test-Time Training

score-first TTT

parameters: {"online_cache_update":true}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Novel Contributions

Eval-time n-gram cache with adaptive entropy-based blending
Eval-time long phrase cache with multi-length phrase probes
Sliding window evaluation with online cache updates from already-scored tokens only
Minimal integration into the baseline with a small code change and one new cache file