Quest speeds long-context LLM decoding by loading only the KV cache pages likely relevant to the current query

June 16, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper includes a public implementation, GPU kernels, and multi-dataset evaluations; results are promising but primarily shown on a few 7B-class models and GPUs.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

Links

Abstract / PDF / Code

Why It Matters For Business

Quest reduces memory bandwidth and decode latency for very long-context LLM calls, lowering GPU cost per request and improving responsiveness for document-heavy applications.

Who Should Care

Summary TLDR

Quest is a query-aware KV-cache selection algorithm for long-context LLM inference. It stores per-page Key min/max vectors as tiny metadata and, at each decode step, scores pages by how large their worst-case dot-products with the current Query can be. Quest then loads only the top-K pages for attention. On 32K contexts and typical models, Quest cuts self-attention memory movement and achieves up to 7.03× self-attention speedup and 2.23× end-to-end decoding speedup with negligible accuracy loss on long-context benchmarks. Code is public.

Problem Statement

Long LLM contexts (tens of thousands of tokens) slow decode-stage inference because the full KV cache must be loaded for each token. Prior pruning methods drop tokens based on history and can miss tokens that become critical for future queries. We need a fast way to pick which KV cache parts to load per query without discarding the cache.

Main Contribution

Show that which KV tokens matter depends strongly on the current Query vector, motivating query-aware selection.

Introduce Quest: a page-level criticality estimator that uses per-page Key min/max metadata and the current Query to score pages cheaply.

Key Findings

Quest achieves large self-attention speedups by loading only top-K pages instead of the full KV cache.

Numbers7.03× self-attention speedup at 32K seq, token budget 2048

Practical UseFor long-context decoding you can reduce attention time by several× by switching to query-aware page selection and loading only Top-K pages.

Evidence RefFig.9 and Sec.4.3.1

Quest reduces end-to-end decode latency when combined with weight quantization.

Numbers2.23× end-to-end inference speedup with 4-bit weights at 32K seq, budget 2048

Practical UseCombine Quest with low-bit weight quantization for a practical end-to-end latency win in production.

Evidence RefFig.10 and Sec.4.3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Self-attention latency reduction7.03×FlashInfer7.03× faster32K sequence length, token budget 2048Measured kernel-level reduction; Fig.9Sec.4.3.1 / Fig.9
End-to-end decode latency2.23× speedupFlashInfer (full KV cache) with FP162.23× faster32K sequence length, token budget 2048, 4-bit weightsMeasured single-batch text generation latency; Fig.10Sec.4.3.2 / Fig.10

What To Try In 7 Days

Clone Quest repo and reproduce kernel tests on a small long-context model and sample inputs.

Run passkey retrieval or NarrativeQA with and without Quest to measure accuracy vs token budget.

Tune page size and Top-K token budget to find the accuracy/speed sweet spot for your workload.

Optimization Features

Token Efficiency
Token budget concept (load only tokens in Top-K pages)Accuracy
Infra Optimization
Reduces memory movement and GPU bandwidth pressureCompatible with low-bit weight quantization (4-bit)
System Optimization
Dedicated CUDA kernels and FlashInfer integrationBatched Top-K via RAFT
Inference Optimization
Query-aware KV cache page selectionPer-page Key min/max upper-bound scoringTop-K page loading for sparse self-attention

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Quest is not applied to the first two model layers because they show low sparsity.

Requires tuning of page size and Top-K token budget per model and workload.

When Not To Use

Short-context scenarios where the full KV cache fits in fast memory.

Workloads where early layers are critical and cannot be skipped by page filtering.

Failure Modes

Choosing Top-K too small can miss critical tokens and drop accuracy.

Per-page min/max metadata may give loose upper bounds and lead to unnecessary page loads or missed tokens.

Core Entities

Models

LongChat-7b-v1.5-32kYarn-Llama-2-7b-128kLlama2-7BLlama-7B

Metrics

self-attention latency reductionend-to-end decode latencyAccuracyperplexitytop-token recall rate

Datasets

PG19LongBench (NarrativeQA, HotpotQA, Qasper, TriviaQA, GovReport, MultifieldQA)passkey retrieval (Yarn)

Benchmarks

LongBenchpasskey retrieval