Prune KV-cache neurons to stop indirect prompt-injection without extra LLM calls

Overview

Decision SnapshotReady For Pilot

The method is low-cost and tested across three models and three datasets, showing strong ASR drops and preserved quality; access to KV cache and known context boundaries are required.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 75%

Production readiness: 60%

Novelty: 55%

Authors

Rui Wang, Junda Wu, Yu Xia, Tong Yu, Ruiyi Zhang, Ryan Rossi, Subrata Mitra, Lina Yao, Julian McAuley

Links

Abstract / PDF

Why It Matters For Business

CachePrune reduces indirect prompt-injection risk with minimal compute and no change to prompts or extra LLM calls, protecting production LLM apps while keeping answer quality.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

CachePrune finds neurons in a prompt's transformer KV cache that make the model treat context as instructions, then masks (prunes) those neurons so context is used only as data. It needs only a few samples (default N=8) and prunes a tiny fraction of neurons (default p=0.5%). On tested models and QA datasets it cuts attack success rates from tens of percent to low single digits while keeping answer quality nearly unchanged, and it does not require extra formatting or test-time LLM calls.

Problem Statement

LLMs can mistake context text for instructions and follow injected tasks (indirect prompt injection). Existing fixes either retrain models (costly) or change prompts/workflows (extra computation or worse quality). We need a lightweight defense that keeps original prompts and inference flow.

Main Contribution

CachePrune: identify and mask neurons in the context KV cache that trigger instruction-following.

A preferential attribution loss and selective thresholding to find task-triggering neurons with few samples and preserve answer quality.

Key Findings

CachePrune cuts attack success on LLaMA3-8B (SQuAD) from ~27.86% to ~7.44%.

Numbers27.86% → 7.44% (Table 1, SQuAD LLaMA3-8B)

Practical UseIf you run CachePrune on cached contexts for LLaMA3-8B, expect a large drop in indirect-injection success without reformatting prompts.

Evidence RefTable 1 (SQuAD, LLaMA3-8B)

On Mistral-7B (SQuAD) CachePrune reduces ASR to under 1%.

Numbers9.01% → 0.68% (Table 1, SQuAD Mistral-7B)

Practical UseFor some models, pruning a small neuron subset nearly eliminates indirect-injection on tested QA data.

Evidence RefTable 1 (SQuAD, Mistral-7B)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Attack Success Rate (ASR)	7.44% ± 0.22	Vanilla 27.86%	-20.42 pp	SQuAD (LLaMA3-8B)	Table 1 (SQuAD, LLaMA3-8B)	Table 1
ASR	0.68% ± 0.41	Vanilla 9.01%	-8.33 pp	SQuAD (Mistral-7B)	Table 1 (SQuAD, Mistral-7B)	Table 1

What To Try In 7 Days

Run CachePrune on a copy of your cached contexts with N=8 and p=0.5% to measure ASR and quality.

Compare model outputs pre/post-mask on a small holdout of real prompts to verify no quality drop.

If you use cached contexts across many queries, apply the learned mask once and reuse it to avoid per-request overhead.

Agent Features

Memory

KV cache intervention

Tool Use

KV cache pruning

Architectures

Transformer (KV cache)

Optimization Features

Token Efficiency

mask applied to cached contexts once; no per-response token overhead

Model Optimization

prune neuron activations in KV cache

System Optimization

compatible with context caching to avoid repeated work

Training Optimization

preferential attribution loss for sample-efficient attribution

Inference Optimization

no extra LLM calls or prompt formatting at test time

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Requires access to the model's KV cache and the exact token indices of the context span.

Does not compare directly to heavy training-based defenses (finetuning) in compute-cost tradeoffs.

When Not To Use

You cannot access or modify the model's KV cache (closed API, no activation access).

You can afford large-scale finetuning or specialized training defenses and prefer training-time fixes.

Failure Modes

Pruning a large fraction of neurons degrades clean-task quality.

Adaptive attackers could potentially find token sequences that re-trigger poisoned outputs in some models.

Core Entities

Models

LLaMA3-8BMistral-7B-Instruct-V3.0Phi-3.5-mini-instruct

Metrics

ASRF1ROUGEBERTScoreGPT-Score

Datasets

SQuADHotpotQAWildChat

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CachePrune cuts attack success on LLaMA3-8B (SQuAD) from ~27.86% to ~7.44%.

On Mistral-7B (SQuAD) CachePrune reduces ASR to under 1%.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding

JudgeDeceiver: automatically craft prompts that reliably trick LLM-as-a-Judge to pick an attacker’s response

Key finding

Make tool-using LLM agents provably safe by combining safety engineering, info-flow labels, and MCP extensions

Key finding

A systematic, practitioner-focused map of 193 multi-agent security threats and how 16 frameworks cover them

Key finding