Quest speeds long-context LLM decoding by loading only the KV cache pages likely relevant to the current query

Overview

Decision SnapshotNeeds Validation

The paper includes a public implementation, GPU kernels, and multi-dataset evaluations; results are promising but primarily shown on a few 7B-class models and GPUs.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

Links

Abstract / PDF / Code

Why It Matters For Business

Quest reduces memory bandwidth and decode latency for very long-context LLM calls, lowering GPU cost per request and improving responsiveness for document-heavy applications.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

Quest is a query-aware KV-cache selection algorithm for long-context LLM inference. It stores per-page Key min/max vectors as tiny metadata and, at each decode step, scores pages by how large their worst-case dot-products with the current Query can be. Quest then loads only the top-K pages for attention. On 32K contexts and typical models, Quest cuts self-attention memory movement and achieves up to 7.03× self-attention speedup and 2.23× end-to-end decoding speedup with negligible accuracy loss on long-context benchmarks. Code is public.

Problem Statement

Long LLM contexts (tens of thousands of tokens) slow decode-stage inference because the full KV cache must be loaded for each token. Prior pruning methods drop tokens based on history and can miss tokens that become critical for future queries. We need a fast way to pick which KV cache parts to load per query without discarding the cache.

Main Contribution

Show that which KV tokens matter depends strongly on the current Query vector, motivating query-aware selection.

Introduce Quest: a page-level criticality estimator that uses per-page Key min/max metadata and the current Query to score pages cheaply.

Key Findings

Quest achieves large self-attention speedups by loading only top-K pages instead of the full KV cache.

Numbers7.03× self-attention speedup at 32K seq, token budget 2048

Practical UseFor long-context decoding you can reduce attention time by several× by switching to query-aware page selection and loading only Top-K pages.

Evidence RefFig.9 and Sec.4.3.1

Quest reduces end-to-end decode latency when combined with weight quantization.

Numbers2.23× end-to-end inference speedup with 4-bit weights at 32K seq, budget 2048

Practical UseCombine Quest with low-bit weight quantization for a practical end-to-end latency win in production.

Evidence RefFig.10 and Sec.4.3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Self-attention latency reduction	7.03×	FlashInfer	7.03× faster	32K sequence length, token budget 2048	Measured kernel-level reduction; Fig.9	Sec.4.3.1 / Fig.9
End-to-end decode latency	2.23× speedup	FlashInfer (full KV cache) with FP16	2.23× faster	32K sequence length, token budget 2048, 4-bit weights	Measured single-batch text generation latency; Fig.10	Sec.4.3.2 / Fig.10

What To Try In 7 Days

Clone Quest repo and reproduce kernel tests on a small long-context model and sample inputs.

Run passkey retrieval or NarrativeQA with and without Quest to measure accuracy vs token budget.

Tune page size and Top-K token budget to find the accuracy/speed sweet spot for your workload.

Optimization Features

Token Efficiency

Token budget concept (load only tokens in Top-K pages)Accuracy

Infra Optimization

Reduces memory movement and GPU bandwidth pressureCompatible with low-bit weight quantization (4-bit)

System Optimization

Dedicated CUDA kernels and FlashInfer integrationBatched Top-K via RAFT

Inference Optimization

Query-aware KV cache page selectionPer-page Key min/max upper-bound scoringTop-K page loading for sparse self-attention

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/mit-han-lab/Quest

Risks & Boundaries

Limitations

Quest is not applied to the first two model layers because they show low sparsity.

Requires tuning of page size and Top-K token budget per model and workload.

When Not To Use

Short-context scenarios where the full KV cache fits in fast memory.

Workloads where early layers are critical and cannot be skipped by page filtering.

Failure Modes

Choosing Top-K too small can miss critical tokens and drop accuracy.

Per-page min/max metadata may give loose upper bounds and lead to unnecessary page loads or missed tokens.

Core Entities

Models

LongChat-7b-v1.5-32kYarn-Llama-2-7b-128kLlama2-7BLlama-7B

Metrics

self-attention latency reductionend-to-end decode latencyAccuracyperplexitytop-token recall rate

Datasets

PG19LongBench (NarrativeQA, HotpotQA, Qasper, TriviaQA, GovReport, MultifieldQA)passkey retrieval (Yarn)

Benchmarks

LongBenchpasskey retrieval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Quest achieves large self-attention speedups by loading only top-K pages instead of the full KV cache.

Quest reduces end-to-end decode latency when combined with weight quantization.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Measure many LLMs with only a few test items by learning weighted anchors

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding

Compress ViT with GPU-friendly 2:4 sparsity + quantization to cut size/FLOPs and speed up real GPU inference

Key finding

Trainable structured pruning + a 'collaborative' prompt compresses LLaMA-7B to 5.4B while keeping accuracy

Key finding

Practical survey of how to combine fine-tuned LLMs into one model without retraining

Key finding