PR #1153

open

Non-record: Triton KV-cache backend for autoregressive eval

by LucasErcolanoView on GitHub
val_bpb
1.6507
Architecture
Transformer
Optimizer
Artifact Size
10,458,900 bytes

Training Techniques

Evaluation
autoregressive eval
parameters: {"context_length":1024,"max_tokens":2048}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Other
other
Triton-backed KV-cache evaluation backends with fused grouped-int8 score/apply kernels and fused QJL sign-score plus grouped value apply kernels
parameters: {"backends":["int8_triton","qjl_triton"]}
other
Prewarm logic to avoid first-use Triton JIT compile overhead during short evaluations
parameters: null
other
Benchmark-side peak CUDA memory reporting to sanity-check allocator behavior
parameters: null

Novel Contributions

  • Added Triton-backed KV-cache evaluation paths for autoregressive evaluation
  • Implemented fused grouped-int8 score/apply kernels
  • Implemented fused QJL sign-score kernel plus grouped value apply kernel
  • Added backend selftests and a dedicated GPU benchmark script
  • Added prewarm logic to reduce Triton JIT compile overhead on short evals
  • Added peak CUDA memory reporting in the benchmark