PR #1259
openNon-record: KNN Hidden State Retrieval — Scale Deception from Weak to Strong Models (8xH100)
by himanshudongreView on GitHub
val_bpb
1.1533
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,826,144 bytes
Training Techniques
Architecture
BigramHash
Uses BigramHash as part of the merged leader model stack.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation variant in the model stack.
parameters: null
ReLU²
Uses squared ReLU activation variant in the model stack.
parameters: null
XSA
Uses XSA-all as part of the merged leader architecture.
parameters: null
Weight Averaging
EMA
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: 9
Other
other
Eval-time KNN hidden state retrieval that stores hidden states and mixes nearest-neighbor token distributions with neural predictions.
parameters: {"k":8,"lambda":0.12,"subsample":4}
Novel Contributions
- Eval-time KNN hidden state retrieval with score-first causal datastore updates
- Demonstration that KNN helps weak models but hurts strong competition-quality models
- Vectorized KNN evaluation using torch.cdist to fit within the 600s budget
- Evidence of scale deception / crossover behavior in eval-time augmentation