Reduce long-context LLM latency and keep accuracy past model input limits by reusing and re-positioning cached KV-contexts

February 16, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Kun-Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na

Links

Abstract / PDF

Why It Matters For Business

CacheFocus lowers inference latency and keeps answer quality when LLMs must use many retrieved documents, saving compute cost on long-context production queries without retraining.

Summary TLDR

CacheFocus is a no-retraining method for retrieval-augmented generation that pre-computes document KV caches offline, shifts cached keys in positional space (re-positioning), prunes low-relevance caches per layer, and reassigns freed positions. On Natural Questions and TriviaQA with LLaMA-2 and Qwen2 models it lowers prefill and total inference time and keeps robust accuracy when inputs exceed model max length. The approach trades offline storage for faster, more stable decoding on long-context workloads.

Problem Statement

LLMs struggle with very long retrieved contexts: model input-length limits and naive concatenation cause slow pre-filling and accuracy drops beyond the model's max tokens. Existing fixes need extra training or heuristics and can still degrade when many documents are added. The paper asks: can offline caching plus smart cache positioning and pruning extend effective context and cut inference cost without retraining?

Main Contribution

Query-independent offline caching: precompute per-document KV caches and store a shared prefix to avoid recomputing document encodings at query time.

Cache Re-Positioning: convert cached keys back to an unrotated form and re-apply RoPE rotations to place caches at new positional indices, avoiding attention amplification.

Layer-Adaptive Cache Pruning: accumulate document-level attention scores across layers and prune low-relevance caches during pre-filling to reduce noise.

Adaptive Positional Allocation: after pruning, reassign freed positional slots either by simple alignment ('align') or by attention-guided reordering ('sort') to maximize encoding-space use.

Empirical evaluation: shows lower prefill and total latency and stable accuracy across enlarged contexts on NQ and TQA with LLaMA-2 and Qwen2 models.

Key Findings

CacheFocus cuts total 100-token generation time on long inputs compared to a naive baseline.

Numbers4K input (20 docs): total 4.611s -> 3.162s (−31.4%)

CacheFocus preserves model accuracy when inputs exceed the model's official max length, unlike naive RAG which degrades.

Layer-adaptive pruning prevents accuracy decline when many retrieved documents become noisy.

NumbersPerformance drops for Qwen2 when >20 docs; pruning stabilizes accuracy (Fig. 4)

Adaptive positional allocation improves retrieval re-ranking for a weak lexical retriever.

NumbersBM25 R@5: 0.2500 -> up to 0.2800 (absolute +0.03)

Results

100-token generation time (total)

Value4K input (20 docs): naive 4.611s; CacheFocus w/ prune 3.162s

Baselinenaive 4.611s

BM25 retrieval R@5

Value0.2500 -> 0.2800

BaselineBM25 0.2500

Retriever quality (DPR) R@5

Value0.7075 baseline; pruning variations ~0.69–0.70

BaselineDPR 0.7075

Who Should Care

What To Try In 7 Days

Precompute and store document KV caches for stable documents and measure prefill time savings.

Add a simple per-layer attention aggregator and prune lowest-scoring document caches every few layers (n=4 used here).

Implement RoPE-based re-positioning of cached keys to reuse positional slots and avoid attention amplification; compare latency and accuracy against your current RAG pipeline.

Optimization Features

Token Efficiency

  • Maximize positional encoding utilization to fit more windows

System Optimization

  • Lower prefill complexity via O(L) cache loading
  • Keep decode work proportional to final cache size

Inference Optimization

  • Query-independent offline caching to avoid repeated encoding
  • Layer-adaptive pruning to reduce attended contexts during prefill
  • Cache re-positioning (RoPE inversion/re-apply) to reuse positional slots

Reproducibility

Data Urls

  • Natural Questions (public dataset)
  • TriviaQA (public dataset)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires offline precomputation and storage of per-document caches; costly if documents update frequently.
  • Evaluated only in zero-shot QA settings and on a limited set of LLMs (LLaMA-2, Qwen2).
  • Relies on document segmentation into short passages; not ideal for single very long documents that cannot be split.
  • Pruning decisions could remove useful context if attention scores mis-rank relevance.

When Not To Use

  • Your document collection changes rapidly and offline caches would become stale.
  • You need few-shot in-context examples inside the same windows (paper used zero-shot).
  • Data cannot be reasonably split into short passages for caching.

Failure Modes

  • Stale offline caches when source documents change frequently.
  • Over-pruning removes relevant documents and reduces answer quality.
  • Excessive reuse of positional slots could still amplify attention leading to degradation if repositioning is misapplied.
  • Memory overhead from storing large per-document KV caches.

Core Entities

Models

  • LLaMA-2-7B-Chat
  • Qwen2-1.5B-Instruct
  • Qwen2-7B-Instruct

Metrics

  • Accuracy
  • Recall@K
  • MRR@100
  • 100-token generation time (prefill/decode/total)

Datasets

  • Natural Questions
  • TriviaQA