Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
Quest reduces memory bandwidth and decode latency for very long-context LLM calls, lowering GPU cost per request and improving responsiveness for document-heavy applications.
Summary TLDR
Quest is a query-aware KV-cache selection algorithm for long-context LLM inference. It stores per-page Key min/max vectors as tiny metadata and, at each decode step, scores pages by how large their worst-case dot-products with the current Query can be. Quest then loads only the top-K pages for attention. On 32K contexts and typical models, Quest cuts self-attention memory movement and achieves up to 7.03× self-attention speedup and 2.23× end-to-end decoding speedup with negligible accuracy loss on long-context benchmarks. Code is public.
Problem Statement
Long LLM contexts (tens of thousands of tokens) slow decode-stage inference because the full KV cache must be loaded for each token. Prior pruning methods drop tokens based on history and can miss tokens that become critical for future queries. We need a fast way to pick which KV cache parts to load per query without discarding the cache.
Main Contribution
Show that which KV tokens matter depends strongly on the current Query vector, motivating query-aware selection.
Introduce Quest: a page-level criticality estimator that uses per-page Key min/max metadata and the current Query to score pages cheaply.
Provide CUDA kernels and FlashInfer integration and show up to 7.03× self-attention and 2.23× end-to-end decode speedups with near-lossless accuracy on long-context benchmarks.
Key Findings
Quest achieves large self-attention speedups by loading only top-K pages instead of the full KV cache.
Quest reduces end-to-end decode latency when combined with weight quantization.
Quest preserves long-dependency accuracy with very small token budgets on retrieval-style tasks.
Per-layer sparsity is high except in early layers.
Quest's page selection approximates oracle top-token choices closely, maintaining high recall of top attention tokens.
Memory movement can be reduced substantially; example parameterization gives an 8× reduction.
Results
Self-attention latency reduction
End-to-end decode latency
Accuracy
Perplexity (language modeling)
Who Should Care
What To Try In 7 Days
Clone Quest repo and reproduce kernel tests on a small long-context model and sample inputs.
Run passkey retrieval or NarrativeQA with and without Quest to measure accuracy vs token budget.
Tune page size and Top-K token budget to find the accuracy/speed sweet spot for your workload.
Optimization Features
Token Efficiency
- Token budget concept (load only tokens in Top-K pages)
- Accuracy
Infra Optimization
- Reduces memory movement and GPU bandwidth pressure
- Compatible with low-bit weight quantization (4-bit)
System Optimization
- Dedicated CUDA kernels and FlashInfer integration
- Batched Top-K via RAFT
Inference Optimization
- Query-aware KV cache page selection
- Per-page Key min/max upper-bound scoring
- Top-K page loading for sparse self-attention
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Quest is not applied to the first two model layers because they show low sparsity.
- Requires tuning of page size and Top-K token budget per model and workload.
- Current implementation targets CUDA/FlashInfer; portability to other runtimes is untested in paper.
- The per-page min/max heuristic may misestimate pages when Key distributions are complex.
When Not To Use
- Short-context scenarios where the full KV cache fits in fast memory.
- Workloads where early layers are critical and cannot be skipped by page filtering.
- Environments without GPU/CUDA support or where FlashInfer cannot be used.
Failure Modes
- Choosing Top-K too small can miss critical tokens and drop accuracy.
- Per-page min/max metadata may give loose upper bounds and lead to unnecessary page loads or missed tokens.
- Criticality estimation overhead can dominate at small sequence lengths, reducing benefits.
Core Entities
Models
- LongChat-7b-v1.5-32k
- Yarn-Llama-2-7b-128k
- Llama2-7B
- Llama-7B
Metrics
- self-attention latency reduction
- end-to-end decode latency
- Accuracy
- perplexity
- top-token recall rate
Datasets
- PG19
- LongBench (NarrativeQA, HotpotQA, Qasper, TriviaQA, GovReport, MultifieldQA)
- passkey retrieval (Yarn)
Benchmarks
- LongBench
- passkey retrieval

