Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
CacheFocus lowers inference latency and keeps answer quality when LLMs must use many retrieved documents, saving compute cost on long-context production queries without retraining.
Summary TLDR
CacheFocus is a no-retraining method for retrieval-augmented generation that pre-computes document KV caches offline, shifts cached keys in positional space (re-positioning), prunes low-relevance caches per layer, and reassigns freed positions. On Natural Questions and TriviaQA with LLaMA-2 and Qwen2 models it lowers prefill and total inference time and keeps robust accuracy when inputs exceed model max length. The approach trades offline storage for faster, more stable decoding on long-context workloads.
Problem Statement
LLMs struggle with very long retrieved contexts: model input-length limits and naive concatenation cause slow pre-filling and accuracy drops beyond the model's max tokens. Existing fixes need extra training or heuristics and can still degrade when many documents are added. The paper asks: can offline caching plus smart cache positioning and pruning extend effective context and cut inference cost without retraining?
Main Contribution
Query-independent offline caching: precompute per-document KV caches and store a shared prefix to avoid recomputing document encodings at query time.
Cache Re-Positioning: convert cached keys back to an unrotated form and re-apply RoPE rotations to place caches at new positional indices, avoiding attention amplification.
Layer-Adaptive Cache Pruning: accumulate document-level attention scores across layers and prune low-relevance caches during pre-filling to reduce noise.
Adaptive Positional Allocation: after pruning, reassign freed positional slots either by simple alignment ('align') or by attention-guided reordering ('sort') to maximize encoding-space use.
Empirical evaluation: shows lower prefill and total latency and stable accuracy across enlarged contexts on NQ and TQA with LLaMA-2 and Qwen2 models.
Key Findings
CacheFocus cuts total 100-token generation time on long inputs compared to a naive baseline.
CacheFocus preserves model accuracy when inputs exceed the model's official max length, unlike naive RAG which degrades.
Layer-adaptive pruning prevents accuracy decline when many retrieved documents become noisy.
Adaptive positional allocation improves retrieval re-ranking for a weak lexical retriever.
Results
100-token generation time (total)
BM25 retrieval R@5
Retriever quality (DPR) R@5
Who Should Care
What To Try In 7 Days
Precompute and store document KV caches for stable documents and measure prefill time savings.
Add a simple per-layer attention aggregator and prune lowest-scoring document caches every few layers (n=4 used here).
Implement RoPE-based re-positioning of cached keys to reuse positional slots and avoid attention amplification; compare latency and accuracy against your current RAG pipeline.
Optimization Features
Token Efficiency
- Maximize positional encoding utilization to fit more windows
System Optimization
- Lower prefill complexity via O(L) cache loading
- Keep decode work proportional to final cache size
Inference Optimization
- Query-independent offline caching to avoid repeated encoding
- Layer-adaptive pruning to reduce attended contexts during prefill
- Cache re-positioning (RoPE inversion/re-apply) to reuse positional slots
Reproducibility
Data Urls
- Natural Questions (public dataset)
- TriviaQA (public dataset)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Requires offline precomputation and storage of per-document caches; costly if documents update frequently.
- Evaluated only in zero-shot QA settings and on a limited set of LLMs (LLaMA-2, Qwen2).
- Relies on document segmentation into short passages; not ideal for single very long documents that cannot be split.
- Pruning decisions could remove useful context if attention scores mis-rank relevance.
When Not To Use
- Your document collection changes rapidly and offline caches would become stale.
- You need few-shot in-context examples inside the same windows (paper used zero-shot).
- Data cannot be reasonably split into short passages for caching.
Failure Modes
- Stale offline caches when source documents change frequently.
- Over-pruning removes relevant documents and reduces answer quality.
- Excessive reuse of positional slots could still amplify attention leading to degradation if repositioning is misapplied.
- Memory overhead from storing large per-document KV caches.
Core Entities
Models
- LLaMA-2-7B-Chat
- Qwen2-1.5B-Instruct
- Qwen2-7B-Instruct
Metrics
- Accuracy
- Recall@K
- MRR@100
- 100-token generation time (prefill/decode/total)
Datasets
- Natural Questions
- TriviaQA

