Overview
The paper includes a public implementation, GPU kernels, and multi-dataset evaluations; results are promising but primarily shown on a few 7B-class models and GPUs.
Citations1
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
Quest reduces memory bandwidth and decode latency for very long-context LLM calls, lowering GPU cost per request and improving responsiveness for document-heavy applications.
Who Should Care
Summary TLDR
Quest is a query-aware KV-cache selection algorithm for long-context LLM inference. It stores per-page Key min/max vectors as tiny metadata and, at each decode step, scores pages by how large their worst-case dot-products with the current Query can be. Quest then loads only the top-K pages for attention. On 32K contexts and typical models, Quest cuts self-attention memory movement and achieves up to 7.03× self-attention speedup and 2.23× end-to-end decoding speedup with negligible accuracy loss on long-context benchmarks. Code is public.
Problem Statement
Long LLM contexts (tens of thousands of tokens) slow decode-stage inference because the full KV cache must be loaded for each token. Prior pruning methods drop tokens based on history and can miss tokens that become critical for future queries. We need a fast way to pick which KV cache parts to load per query without discarding the cache.
Main Contribution
Show that which KV tokens matter depends strongly on the current Query vector, motivating query-aware selection.
Introduce Quest: a page-level criticality estimator that uses per-page Key min/max metadata and the current Query to score pages cheaply.
Key Findings
Quest achieves large self-attention speedups by loading only top-K pages instead of the full KV cache.
Quest reduces end-to-end decode latency when combined with weight quantization.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Self-attention latency reduction | 7.03× | FlashInfer | 7.03× faster | 32K sequence length, token budget 2048 | Measured kernel-level reduction; Fig.9 | Sec.4.3.1 / Fig.9 |
| End-to-end decode latency | 2.23× speedup | FlashInfer (full KV cache) with FP16 | 2.23× faster | 32K sequence length, token budget 2048, 4-bit weights | Measured single-batch text generation latency; Fig.10 | Sec.4.3.2 / Fig.10 |
What To Try In 7 Days
Clone Quest repo and reproduce kernel tests on a small long-context model and sample inputs.
Run passkey retrieval or NarrativeQA with and without Quest to measure accuracy vs token budget.
Tune page size and Top-K token budget to find the accuracy/speed sweet spot for your workload.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Quest is not applied to the first two model layers because they show low sparsity.
Requires tuning of page size and Top-K token budget per model and workload.
When Not To Use
Short-context scenarios where the full KV cache fits in fast memory.
Workloads where early layers are critical and cannot be skipped by page filtering.
Failure Modes
Choosing Top-K too small can miss critical tokens and drop accuracy.
Per-page min/max metadata may give loose upper bounds and lead to unnecessary page loads or missed tokens.

