Overview
The method requires offline cache storage and per-layer scoring but shows concrete latency gains and stable accuracy on two public QA datasets; results are promising but limited to the reported LLaMA-2 and Qwen2 experiments.
Citations0
Evidence Strength0.60
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
CacheFocus lowers inference latency and keeps answer quality when LLMs must use many retrieved documents, saving compute cost on long-context production queries without retraining.
Who Should Care
Summary TLDR
CacheFocus is a no-retraining method for retrieval-augmented generation that pre-computes document KV caches offline, shifts cached keys in positional space (re-positioning), prunes low-relevance caches per layer, and reassigns freed positions. On Natural Questions and TriviaQA with LLaMA-2 and Qwen2 models it lowers prefill and total inference time and keeps robust accuracy when inputs exceed model max length. The approach trades offline storage for faster, more stable decoding on long-context workloads.
Problem Statement
LLMs struggle with very long retrieved contexts: model input-length limits and naive concatenation cause slow pre-filling and accuracy drops beyond the model's max tokens. Existing fixes need extra training or heuristics and can still degrade when many documents are added. The paper asks: can offline caching plus smart cache positioning and pruning extend effective context and cut inference cost without retraining?
Main Contribution
Query-independent offline caching: precompute per-document KV caches and store a shared prefix to avoid recomputing document encodings at query time.
Cache Re-Positioning: convert cached keys back to an unrotated form and re-apply RoPE rotations to place caches at new positional indices, avoiding attention amplification.
Key Findings
CacheFocus cuts total 100-token generation time on long inputs compared to a naive baseline.
CacheFocus preserves model accuracy when inputs exceed the model's official max length, unlike naive RAG which degrades.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| 100-token generation time (total) | 4K input (20 docs): naive 4.611s; CacheFocus w/ prune 3.162s | naive 4.611s | −1.449s (−31.4%) | LLaMA-2-7B-Chat, Table 2 | Table 2 reports prefill/decode/total times for naive and cached/pruned settings | Table 2 |
| BM25 retrieval R@5 | 0.2500 -> 0.2800 | BM25 0.2500 | +0.0300 (absolute) | NQ, Qwen2-1.5B-Instruct, Table 1 | Table 1 shows R@K under baseline and pruning/positional strategies | Table 1 |
What To Try In 7 Days
Precompute and store document KV caches for stable documents and measure prefill time savings.
Add a simple per-layer attention aggregator and prune lowest-scoring document caches every few layers (n=4 used here).
Implement RoPE-based re-positioning of cached keys to reuse positional slots and avoid attention amplification; compare latency and accuracy against your current RAG pipeline.
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Requires offline precomputation and storage of per-document caches; costly if documents update frequently.
Evaluated only in zero-shot QA settings and on a limited set of LLMs (LLaMA-2, Qwen2).
When Not To Use
Your document collection changes rapidly and offline caches would become stale.
You need few-shot in-context examples inside the same windows (paper used zero-shot).
Failure Modes
Stale offline caches when source documents change frequently.
Over-pruning removes relevant documents and reduces answer quality.

