Prune KV-cache neurons to stop indirect prompt-injection without extra LLM calls

April 29, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.75

Citation Count

0

Authors

Rui Wang, Junda Wu, Yu Xia, Tong Yu, Ruiyi Zhang, Ryan Rossi, Subrata Mitra, Lina Yao, Julian McAuley

Links

Abstract / PDF

Why It Matters For Business

CachePrune reduces indirect prompt-injection risk with minimal compute and no change to prompts or extra LLM calls, protecting production LLM apps while keeping answer quality.

Summary TLDR

CachePrune finds neurons in a prompt's transformer KV cache that make the model treat context as instructions, then masks (prunes) those neurons so context is used only as data. It needs only a few samples (default N=8) and prunes a tiny fraction of neurons (default p=0.5%). On tested models and QA datasets it cuts attack success rates from tens of percent to low single digits while keeping answer quality nearly unchanged, and it does not require extra formatting or test-time LLM calls.

Problem Statement

LLMs can mistake context text for instructions and follow injected tasks (indirect prompt injection). Existing fixes either retrain models (costly) or change prompts/workflows (extra computation or worse quality). We need a lightweight defense that keeps original prompts and inference flow.

Main Contribution

CachePrune: identify and mask neurons in the context KV cache that trigger instruction-following.

A preferential attribution loss and selective thresholding to find task-triggering neurons with few samples and preserve answer quality.

Empirical evidence that masking the KV-cache neurons cuts attack success while preserving response quality and transfers across attacks and models.

Key Findings

CachePrune cuts attack success on LLaMA3-8B (SQuAD) from ~27.86% to ~7.44%.

Numbers27.86% → 7.44% (Table 1, SQuAD LLaMA3-8B)

On Mistral-7B (SQuAD) CachePrune reduces ASR to under 1%.

Numbers9.01% → 0.68% (Table 1, SQuAD Mistral-7B)

Answer quality is preserved after pruning on evaluated tasks.

NumbersF1(clean) 28.20 → 28.68 (LLaMA3-8B SQuAD, Table 1)

CachePrune needs very few samples and prunes a tiny fraction of neurons.

NumbersN=8 samples, p=0.5% neurons (Sec. 4.1; Table 6)

Results

Attack Success Rate (ASR)

Value7.44% ± 0.22

BaselineVanilla 27.86%

ASR

Value0.68% ± 0.41

BaselineVanilla 9.01%

F1 (clean)

Value28.68 ± 0.30

BaselineVanilla 28.20

Sample efficiency

ValueN = 8 samples (default)

BaselineNot applicable

Who Should Care

What To Try In 7 Days

Run CachePrune on a copy of your cached contexts with N=8 and p=0.5% to measure ASR and quality.

Compare model outputs pre/post-mask on a small holdout of real prompts to verify no quality drop.

If you use cached contexts across many queries, apply the learned mask once and reuse it to avoid per-request overhead.

Agent Features

Memory

  • KV cache intervention

Tool Use

  • KV cache pruning

Architectures

  • Transformer (KV cache)

Optimization Features

Token Efficiency

  • mask applied to cached contexts once; no per-response token overhead

Model Optimization

  • prune neuron activations in KV cache

System Optimization

  • compatible with context caching to avoid repeated work

Training Optimization

  • preferential attribution loss for sample-efficient attribution

Inference Optimization

  • no extra LLM calls or prompt formatting at test time

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires access to the model's KV cache and the exact token indices of the context span.
  • Does not compare directly to heavy training-based defenses (finetuning) in compute-cost tradeoffs.
  • Mask transfer is effective but not guaranteed for all attack types or unseen prompts.
  • Pruning too large a neuron fraction can harm task performance (see ablation on p).

When Not To Use

  • You cannot access or modify the model's KV cache (closed API, no activation access).
  • You can afford large-scale finetuning or specialized training defenses and prefer training-time fixes.
  • Context boundaries are unknown or dynamic and cannot be reliably marked.

Failure Modes

  • Pruning a large fraction of neurons degrades clean-task quality.
  • Adaptive attackers could potentially find token sequences that re-trigger poisoned outputs in some models.
  • Learned masks may not generalize to very different prompt templates or unseen injection styles.

Core Entities

Models

  • LLaMA3-8B
  • Mistral-7B-Instruct-V3.0
  • Phi-3.5-mini-instruct

Metrics

  • ASR
  • F1
  • ROUGE
  • BERTScore
  • GPT-Score

Datasets

  • SQuAD
  • HotpotQA
  • WildChat