PR #1149
openAdd non-record submission: faithful KV-cache quantization backends on 1x RTX 3090
by LucasErcolanoView on GitHub
val_bpb
1.6507
Architecture
Transformer
Optimizer
—
Artifact Size
10,458,900 bytes
Training Techniques
Quantization
int8
bits: 8
scope: model artifact export
Compression
zlib
level: null
Evaluation
autoregressive KV-cache eval
parameters: {"context_length":1024,"max_tokens":2048}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Other
other
Teacher-forced autoregressive evaluation path with explicit per-layer KV cache updates and selectable KV-cache quantization backends (none, qjl, polar, turbo).
parameters: {"backends":["none","qjl","polar","turbo"]}
Novel Contributions
- Teacher-forced autoregressive KV-cache evaluation path with explicit per-layer cache updates
- Paper-inspired KV-cache quantization backends: QJL-style, PolarQuant-style, and TurboQuant-style
- Pure PyTorch KV evaluator without Triton/custom CUDA kernels
- Evaluation of multiple KV-cache backends on the same checkpoint with capped validation tokens