PR #913
openRecord: Cache Is All You Need — val_bpb 0.0887, 622KB artifact (3-seed mean)
by RoyiRaView on GitHub
val_bpb
0.0887
Architecture
Transformer
Optimizer
Muon
Artifact Size
622 KB
Training Techniques
Architecture
weight tying
Input and output embeddings are tied.
parameters: null
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":4,"kv_heads":2}
Regularization
logit softcap
parameters: {"value":30}
logit softcap
parameters: {"value":30}
Quantization
int8
bits: 8
scope: baseline model
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Evaluation
sliding window eval
parameters: null
Other
other
Eval-time n-gram cache with order-adaptive entropy gating and adaptive blending with model probabilities.
parameters: {"ngram_order":"2-12"}
other
Eval-time long phrase cache using hashed phrase probes at multiple lengths.
parameters: {"phrase_lengths":[64,56,48,36,28,20,16]}
Test-Time Training
score-first TTT
parameters: {"online_cache_update":true}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Novel Contributions
- Eval-time n-gram cache with adaptive entropy-based blending
- Eval-time long phrase cache with multi-length phrase probes
- Sliding window evaluation with online cache updates from already-scored tokens only
- Minimal integration into the baseline with a small code change and one new cache file