Reduce long-context LLM latency and keep accuracy past model input limits by reusing and re-positioning cached KV-contexts

Overview

Decision SnapshotNeeds Validation

The method requires offline cache storage and per-layer scoring but shows concrete latency gains and stable accuracy on two public QA datasets; results are promising but limited to the reported LLaMA-2 and Qwen2 experiments.

Citations0

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Kun-Hui Lee, Eunhwan Park, Donghoon Han, Seung-Hoon Na

Links

Abstract / PDF / Data

Why It Matters For Business

CacheFocus lowers inference latency and keeps answer quality when LLMs must use many retrieved documents, saving compute cost on long-context production queries without retraining.

Who Should Care

CTO ML Engineer Engineering Lead Data Scientist Product Manager

Summary TLDR

CacheFocus is a no-retraining method for retrieval-augmented generation that pre-computes document KV caches offline, shifts cached keys in positional space (re-positioning), prunes low-relevance caches per layer, and reassigns freed positions. On Natural Questions and TriviaQA with LLaMA-2 and Qwen2 models it lowers prefill and total inference time and keeps robust accuracy when inputs exceed model max length. The approach trades offline storage for faster, more stable decoding on long-context workloads.

Problem Statement

LLMs struggle with very long retrieved contexts: model input-length limits and naive concatenation cause slow pre-filling and accuracy drops beyond the model's max tokens. Existing fixes need extra training or heuristics and can still degrade when many documents are added. The paper asks: can offline caching plus smart cache positioning and pruning extend effective context and cut inference cost without retraining?

Main Contribution

Query-independent offline caching: precompute per-document KV caches and store a shared prefix to avoid recomputing document encodings at query time.

Cache Re-Positioning: convert cached keys back to an unrotated form and re-apply RoPE rotations to place caches at new positional indices, avoiding attention amplification.

Key Findings

CacheFocus cuts total 100-token generation time on long inputs compared to a naive baseline.

Numbers4K input (20 docs): total 4.611s -> 3.162s (−31.4%)

Practical UseUse offline KV caches plus pruning to lower end-to-end latency and inference cost for long-context queries.

Evidence RefTable 2

CacheFocus preserves model accuracy when inputs exceed the model's official max length, unlike naive RAG which degrades.

Practical UseFor tasks that need more context than the model supports, repositioned caches let you use extra documents without retraining the model.

Evidence RefFigure 1 (LLaMA-2-7B-Chat) and text

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
100-token generation time (total)	4K input (20 docs): naive 4.611s; CacheFocus w/ prune 3.162s	naive 4.611s	−1.449s (−31.4%)	LLaMA-2-7B-Chat, Table 2	Table 2 reports prefill/decode/total times for naive and cached/pruned settings	Table 2
BM25 retrieval R@5	0.2500 -> 0.2800	BM25 0.2500	+0.0300 (absolute)	NQ, Qwen2-1.5B-Instruct, Table 1	Table 1 shows R@K under baseline and pruning/positional strategies	Table 1

What To Try In 7 Days

Precompute and store document KV caches for stable documents and measure prefill time savings.

Add a simple per-layer attention aggregator and prune lowest-scoring document caches every few layers (n=4 used here).

Implement RoPE-based re-positioning of cached keys to reuse positional slots and avoid attention amplification; compare latency and accuracy against your current RAG pipeline.

Optimization Features

Token Efficiency

Maximize positional encoding utilization to fit more windows

System Optimization

Lower prefill complexity via O(L) cache loadingKeep decode work proportional to final cache size

Inference Optimization

Query-independent offline caching to avoid repeated encodingLayer-adaptive pruning to reduce attended contexts during prefillCache re-positioning (RoPE inversion/re-apply) to reuse positional slots

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

Natural Questions (public dataset)TriviaQA (public dataset)

Risks & Boundaries

Limitations

Requires offline precomputation and storage of per-document caches; costly if documents update frequently.

Evaluated only in zero-shot QA settings and on a limited set of LLMs (LLaMA-2, Qwen2).

When Not To Use

Your document collection changes rapidly and offline caches would become stale.

You need few-shot in-context examples inside the same windows (paper used zero-shot).

Failure Modes

Stale offline caches when source documents change frequently.

Over-pruning removes relevant documents and reduces answer quality.

Core Entities

Models

LLaMA-2-7B-ChatQwen2-1.5B-InstructQwen2-7B-Instruct

Metrics

AccuracyRecall@KMRR@100100-token generation time (prefill/decode/total)

Datasets

Natural QuestionsTriviaQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CacheFocus cuts total 100-token generation time on long inputs compared to a naive baseline.

CacheFocus preserves model accuracy when inputs exceed the model's official max length, unlike naive RAG which degrades.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Add explicit, verifiable rationales and reranking to RAG to cut hallucinations in biomedical QA

Key finding

Teach LLMs to spot and avoid context-based hallucinations by masking retrieval heads and contrastive tuning

Key finding

Fin-RATE: a realistic SEC-filings benchmark that stresses cross-document, cross-year and cross-company financial reasoning

Key finding

Not all retrieval noise is bad: some noises consistently help LLMs, others break them

Key finding

Marathon: a multiple-choice benchmark that stresses LLMs with very long documents (up to ~260K chars)

Key finding