PR #913

open

Record: Cache Is All You Need — val_bpb 0.0887, 622KB artifact (3-seed mean)

val_bpb
0.0887
Architecture
Transformer
Optimizer
Muon
Artifact Size
622 KB

Training Techniques

Architecture
weight tying
Input and output embeddings are tied.
parameters: null
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":4,"kv_heads":2}
Regularization
logit softcap
parameters: {"value":30}
logit softcap
parameters: {"value":30}
Quantization
int8
bits: 8
scope: baseline model
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Evaluation
sliding window eval
parameters: null
Other
other
Eval-time n-gram cache with order-adaptive entropy gating and adaptive blending with model probabilities.
parameters: {"ngram_order":"2-12"}
other
Eval-time long phrase cache using hashed phrase probes at multiple lengths.
parameters: {"phrase_lengths":[64,56,48,36,28,20,16]}
Test-Time Training
score-first TTT
parameters: {"online_cache_update":true}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024

Novel Contributions

  • Eval-time n-gram cache with adaptive entropy-based blending
  • Eval-time long phrase cache with multi-length phrase probes
  • Sliding window evaluation with online cache updates from already-scored tokens only
  • Minimal integration into the baseline with a small code change and one new cache file