Quest speeds long-context LLM decoding by loading only the KV cache pages likely relevant to the current query

June 16, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

1

Authors

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

Links

Abstract / PDF

Why It Matters For Business

Quest reduces memory bandwidth and decode latency for very long-context LLM calls, lowering GPU cost per request and improving responsiveness for document-heavy applications.

Summary TLDR

Quest is a query-aware KV-cache selection algorithm for long-context LLM inference. It stores per-page Key min/max vectors as tiny metadata and, at each decode step, scores pages by how large their worst-case dot-products with the current Query can be. Quest then loads only the top-K pages for attention. On 32K contexts and typical models, Quest cuts self-attention memory movement and achieves up to 7.03× self-attention speedup and 2.23× end-to-end decoding speedup with negligible accuracy loss on long-context benchmarks. Code is public.

Problem Statement

Long LLM contexts (tens of thousands of tokens) slow decode-stage inference because the full KV cache must be loaded for each token. Prior pruning methods drop tokens based on history and can miss tokens that become critical for future queries. We need a fast way to pick which KV cache parts to load per query without discarding the cache.

Main Contribution

Show that which KV tokens matter depends strongly on the current Query vector, motivating query-aware selection.

Introduce Quest: a page-level criticality estimator that uses per-page Key min/max metadata and the current Query to score pages cheaply.

Provide CUDA kernels and FlashInfer integration and show up to 7.03× self-attention and 2.23× end-to-end decode speedups with near-lossless accuracy on long-context benchmarks.

Key Findings

Quest achieves large self-attention speedups by loading only top-K pages instead of the full KV cache.

Numbers7.03× self-attention speedup at 32K seq, token budget 2048

Quest reduces end-to-end decode latency when combined with weight quantization.

Numbers2.23× end-to-end inference speedup with 4-bit weights at 32K seq, budget 2048

Quest preserves long-dependency accuracy with very small token budgets on retrieval-style tasks.

Numbers99–100% passkey retrieval accuracy with 64–1024 token budgets (1%–1.6% of 10k–100k length)

Per-layer sparsity is high except in early layers.

NumbersFirst two layers <10% sparsity; later layers >90% sparsity on LongChat-7B

Quest's page selection approximates oracle top-token choices closely, maintaining high recall of top attention tokens.

NumbersRecall rates close to full attention across decoding rounds (see Fig.4)

Memory movement can be reduced substantially; example parameterization gives an 8× reduction.

NumbersExample: 16 KV pairs/page, 64K context, top 4K pages → 8× memory load reduction

Results

Self-attention latency reduction

Value7.03×

BaselineFlashInfer

End-to-end decode latency

Value2.23× speedup

BaselineFlashInfer (full KV cache) with FP16

Accuracy

Value99–100%

BaselineFull KV cache (oracle)

Perplexity (language modeling)

ValueClosely matches full-cache model

BaselineFull KV cache oracle

Who Should Care

What To Try In 7 Days

Clone Quest repo and reproduce kernel tests on a small long-context model and sample inputs.

Run passkey retrieval or NarrativeQA with and without Quest to measure accuracy vs token budget.

Tune page size and Top-K token budget to find the accuracy/speed sweet spot for your workload.

Optimization Features

Token Efficiency

  • Token budget concept (load only tokens in Top-K pages)
  • Accuracy

Infra Optimization

  • Reduces memory movement and GPU bandwidth pressure
  • Compatible with low-bit weight quantization (4-bit)

System Optimization

  • Dedicated CUDA kernels and FlashInfer integration
  • Batched Top-K via RAFT

Inference Optimization

  • Query-aware KV cache page selection
  • Per-page Key min/max upper-bound scoring
  • Top-K page loading for sparse self-attention

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Quest is not applied to the first two model layers because they show low sparsity.
  • Requires tuning of page size and Top-K token budget per model and workload.
  • Current implementation targets CUDA/FlashInfer; portability to other runtimes is untested in paper.
  • The per-page min/max heuristic may misestimate pages when Key distributions are complex.

When Not To Use

  • Short-context scenarios where the full KV cache fits in fast memory.
  • Workloads where early layers are critical and cannot be skipped by page filtering.
  • Environments without GPU/CUDA support or where FlashInfer cannot be used.

Failure Modes

  • Choosing Top-K too small can miss critical tokens and drop accuracy.
  • Per-page min/max metadata may give loose upper bounds and lead to unnecessary page loads or missed tokens.
  • Criticality estimation overhead can dominate at small sequence lengths, reducing benefits.

Core Entities

Models

  • LongChat-7b-v1.5-32k
  • Yarn-Llama-2-7b-128k
  • Llama2-7B
  • Llama-7B

Metrics

  • self-attention latency reduction
  • end-to-end decode latency
  • Accuracy
  • perplexity
  • top-token recall rate

Datasets

  • PG19
  • LongBench (NarrativeQA, HotpotQA, Qasper, TriviaQA, GovReport, MultifieldQA)
  • passkey retrieval (Yarn)

Benchmarks

  • LongBench
  • passkey retrieval