PR #1149

open

Add non-record submission: faithful KV-cache quantization backends on 1x RTX 3090

by LucasErcolanoView on GitHub

val_bpb

1.6507

Architecture

Transformer

Optimizer

—

Artifact Size

10,458,900 bytes

Training Techniques

Quantization

int8

bits: 8

scope: model artifact export

Compression

zlib

level: null

Evaluation

autoregressive KV-cache eval

parameters: {"context_length":1024,"max_tokens":2048}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Other

other

Teacher-forced autoregressive evaluation path with explicit per-layer KV cache updates and selectable KV-cache quantization backends (none, qjl, polar, turbo).

parameters: {"backends":["none","qjl","polar","turbo"]}

Novel Contributions

Teacher-forced autoregressive KV-cache evaluation path with explicit per-layer cache updates
Paper-inspired KV-cache quantization backends: QJL-style, PolarQuant-style, and TurboQuant-style
Pure PyTorch KV evaluator without Triton/custom CUDA kernels
Evaluation of multiple KV-cache backends on the same checkpoint with capped validation tokens