PR #1259

open

Non-record: KNN Hidden State Retrieval — Scale Deception from Weak to Strong Models (8xH100)

by himanshudongreView on GitHub

val_bpb

1.1533

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,826,144 bytes

Training Techniques

Architecture

BigramHash

Uses BigramHash as part of the merged leader model stack.

parameters: null

LeakyReLU

Uses LeakyReLU squared activation variant in the model stack.

parameters: null

ReLU²

Uses squared ReLU activation variant in the model stack.

parameters: null

XSA

Uses XSA-all as part of the merged leader architecture.

parameters: null

Weight Averaging

EMA

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: 9

Other

other

Eval-time KNN hidden state retrieval that stores hidden states and mixes nearest-neighbor token distributions with neural predictions.

parameters: {"k":8,"lambda":0.12,"subsample":4}

Eval-time KNN hidden state retrieval with score-first causal datastore updates
Demonstration that KNN helps weak models but hurts strong competition-quality models
Vectorized KNN evaluation using torch.cdist to fit within the 600s budget
Evidence of scale deception / crossover behavior in eval-time augmentation