PR #318
openNeural Cache: Cross-Window KV Caching for Extended Eval Context (research proposal)
by sseanliuView on GitHub
val_bpb
1.1284
Architecture
Transformer
Optimizer
AdamW
Artifact Size
—
Training Techniques
Architecture
RoPE
Uses NTK-aware RoPE scaling for longer sequences and discusses extending effective context via cross-window KV caching.
parameters: {"train_seq_len":1024,"cache_tokens":8192,"effective_context":"50K+"}
XSA
Uses XSA in the base model recipe, with only the last 4 layers enabled for XSA.
parameters: {"layers":4}
SmearGate
Included as part of the base model recipe.
parameters: null
BigramHash
Included as part of the base model recipe.
parameters: {"bigram_vocab_size":2048}
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
cross-window KV caching
parameters: {"stride":64,"context_length":2048,"cache_tokens":8192}
Sequence Length
sequence_length
train_length: 1024
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
Other
other
Eval-time technique that caches K/V pairs across sliding windows to extend effective context without training or changing model weights.
parameters: {"cache_tokens":8192,"stride":64}
Novel Contributions
- Cross-window KV caching at evaluation time to extend effective context beyond the sliding window.
- Backward-looking-only cache that reuses already-evaluated tokens without training on validation data.
- Compatibility with FlashAttention 3 for seqlen_k > seqlen_q without custom kernels.
- Per-layer cache that stores only the newest stride tokens to reduce redundancy.
- Proposal to mitigate long-context RoPE degradation via partial-layer caching and cache-size limits.