Reduce long-context LLM latency and keep accuracy past model input limits by reusing and re-positioning cached KV-contexts

February 16, 20257 min

Overview

Decision SnapshotNeeds Validation

The method requires offline cache storage and per-layer scoring but shows concrete latency gains and stable accuracy on two public QA datasets; results are promising but limited to the reported LLaMA-2 and Qwen2 experiments.

Citations0

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Kun-Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na

Links

Abstract / PDF / Data

Why It Matters For Business

CacheFocus lowers inference latency and keeps answer quality when LLMs must use many retrieved documents, saving compute cost on long-context production queries without retraining.

Who Should Care

Summary TLDR

CacheFocus is a no-retraining method for retrieval-augmented generation that pre-computes document KV caches offline, shifts cached keys in positional space (re-positioning), prunes low-relevance caches per layer, and reassigns freed positions. On Natural Questions and TriviaQA with LLaMA-2 and Qwen2 models it lowers prefill and total inference time and keeps robust accuracy when inputs exceed model max length. The approach trades offline storage for faster, more stable decoding on long-context workloads.

Problem Statement

LLMs struggle with very long retrieved contexts: model input-length limits and naive concatenation cause slow pre-filling and accuracy drops beyond the model's max tokens. Existing fixes need extra training or heuristics and can still degrade when many documents are added. The paper asks: can offline caching plus smart cache positioning and pruning extend effective context and cut inference cost without retraining?

Main Contribution

Query-independent offline caching: precompute per-document KV caches and store a shared prefix to avoid recomputing document encodings at query time.

Cache Re-Positioning: convert cached keys back to an unrotated form and re-apply RoPE rotations to place caches at new positional indices, avoiding attention amplification.

Key Findings

CacheFocus cuts total 100-token generation time on long inputs compared to a naive baseline.

Numbers4K input (20 docs): total 4.611s -> 3.162s (−31.4%)

Practical UseUse offline KV caches plus pruning to lower end-to-end latency and inference cost for long-context queries.

Evidence RefTable 2

CacheFocus preserves model accuracy when inputs exceed the model's official max length, unlike naive RAG which degrades.

Practical UseFor tasks that need more context than the model supports, repositioned caches let you use extra documents without retraining the model.

Evidence RefFigure 1 (LLaMA-2-7B-Chat) and text

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
100-token generation time (total)4K input (20 docs): naive 4.611s; CacheFocus w/ prune 3.162snaive 4.611s−1.449s (−31.4%)LLaMA-2-7B-Chat, Table 2Table 2 reports prefill/decode/total times for naive and cached/pruned settingsTable 2
BM25 retrieval R@50.2500 -> 0.2800BM25 0.2500+0.0300 (absolute)NQ, Qwen2-1.5B-Instruct, Table 1Table 1 shows R@K under baseline and pruning/positional strategiesTable 1

What To Try In 7 Days

Precompute and store document KV caches for stable documents and measure prefill time savings.

Add a simple per-layer attention aggregator and prune lowest-scoring document caches every few layers (n=4 used here).

Implement RoPE-based re-positioning of cached keys to reuse positional slots and avoid attention amplification; compare latency and accuracy against your current RAG pipeline.

Optimization Features

Token Efficiency
Maximize positional encoding utilization to fit more windows
System Optimization
Lower prefill complexity via O(L) cache loadingKeep decode work proportional to final cache size
Inference Optimization
Query-independent offline caching to avoid repeated encodingLayer-adaptive pruning to reduce attended contexts during prefillCache re-positioning (RoPE inversion/re-apply) to reuse positional slots

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

Natural Questions (public dataset)TriviaQA (public dataset)

Risks & Boundaries

Limitations

Requires offline precomputation and storage of per-document caches; costly if documents update frequently.

Evaluated only in zero-shot QA settings and on a limited set of LLMs (LLaMA-2, Qwen2).

When Not To Use

Your document collection changes rapidly and offline caches would become stale.

You need few-shot in-context examples inside the same windows (paper used zero-shot).

Failure Modes

Stale offline caches when source documents change frequently.

Over-pruning removes relevant documents and reduces answer quality.

Core Entities

Models

LLaMA-2-7B-ChatQwen2-1.5B-InstructQwen2-7B-Instruct

Metrics

AccuracyRecall@KMRR@100100-token generation time (prefill/decode/total)

Datasets

Natural QuestionsTriviaQA