Overview
The method is low-cost and tested across three models and three datasets, showing strong ASR drops and preserved quality; access to KV cache and known context boundaries are required.
Citations0
Evidence Strength0.80
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 75%
Production readiness: 60%
Novelty: 55%
Why It Matters For Business
CachePrune reduces indirect prompt-injection risk with minimal compute and no change to prompts or extra LLM calls, protecting production LLM apps while keeping answer quality.
Who Should Care
Summary TLDR
CachePrune finds neurons in a prompt's transformer KV cache that make the model treat context as instructions, then masks (prunes) those neurons so context is used only as data. It needs only a few samples (default N=8) and prunes a tiny fraction of neurons (default p=0.5%). On tested models and QA datasets it cuts attack success rates from tens of percent to low single digits while keeping answer quality nearly unchanged, and it does not require extra formatting or test-time LLM calls.
Problem Statement
LLMs can mistake context text for instructions and follow injected tasks (indirect prompt injection). Existing fixes either retrain models (costly) or change prompts/workflows (extra computation or worse quality). We need a lightweight defense that keeps original prompts and inference flow.
Main Contribution
CachePrune: identify and mask neurons in the context KV cache that trigger instruction-following.
A preferential attribution loss and selective thresholding to find task-triggering neurons with few samples and preserve answer quality.
Key Findings
CachePrune cuts attack success on LLaMA3-8B (SQuAD) from ~27.86% to ~7.44%.
On Mistral-7B (SQuAD) CachePrune reduces ASR to under 1%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Attack Success Rate (ASR) | 7.44% ± 0.22 | Vanilla 27.86% | -20.42 pp | SQuAD (LLaMA3-8B) | Table 1 (SQuAD, LLaMA3-8B) | Table 1 |
| ASR | 0.68% ± 0.41 | Vanilla 9.01% | -8.33 pp | SQuAD (Mistral-7B) | Table 1 (SQuAD, Mistral-7B) | Table 1 |
What To Try In 7 Days
Run CachePrune on a copy of your cached contexts with N=8 and p=0.5% to measure ASR and quality.
Compare model outputs pre/post-mask on a small holdout of real prompts to verify no quality drop.
If you use cached contexts across many queries, apply the learned mask once and reuse it to avoid per-request overhead.
Agent Features
Memory
Tool Use
Architectures
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires access to the model's KV cache and the exact token indices of the context span.
Does not compare directly to heavy training-based defenses (finetuning) in compute-cost tradeoffs.
When Not To Use
You cannot access or modify the model's KV cache (closed API, no activation access).
You can afford large-scale finetuning or specialized training defenses and prefer training-time fixes.
Failure Modes
Pruning a large fraction of neurons degrades clean-task quality.
Adaptive attackers could potentially find token sequences that re-trigger poisoned outputs in some models.

