PR #1153
openNon-record: Triton KV-cache backend for autoregressive eval
by LucasErcolanoView on GitHub
val_bpb
1.6507
Architecture
Transformer
Optimizer
—
Artifact Size
10,458,900 bytes
Training Techniques
Evaluation
autoregressive eval
parameters: {"context_length":1024,"max_tokens":2048}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Other
other
Triton-backed KV-cache evaluation backends with fused grouped-int8 score/apply kernels and fused QJL sign-score plus grouped value apply kernels
parameters: {"backends":["int8_triton","qjl_triton"]}
other
Prewarm logic to avoid first-use Triton JIT compile overhead during short evaluations
parameters: null
other
Benchmark-side peak CUDA memory reporting to sanity-check allocator behavior
parameters: null
Novel Contributions
- Added Triton-backed KV-cache evaluation paths for autoregressive evaluation
- Implemented fused grouped-int8 score/apply kernels
- Implemented fused QJL sign-score kernel plus grouped value apply kernel
- Added backend selftests and a dedicated GPU benchmark script
- Added prewarm logic to reduce Triton JIT compile overhead on short evals
- Added peak CUDA memory reporting in the benchmark