Prune KV-cache neurons to stop indirect prompt-injection without extra LLM calls

April 29, 20257 min

Overview

Decision SnapshotReady For Pilot

The method is low-cost and tested across three models and three datasets, showing strong ASR drops and preserved quality; access to KV cache and known context boundaries are required.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 75%

Production readiness: 60%

Novelty: 55%

Authors

Rui Wang, Junda Wu, Yu Xia, Tong Yu, Ruiyi Zhang, Ryan Rossi, Subrata Mitra, Lina Yao, Julian McAuley

Links

Abstract / PDF

Why It Matters For Business

CachePrune reduces indirect prompt-injection risk with minimal compute and no change to prompts or extra LLM calls, protecting production LLM apps while keeping answer quality.

Who Should Care

Summary TLDR

CachePrune finds neurons in a prompt's transformer KV cache that make the model treat context as instructions, then masks (prunes) those neurons so context is used only as data. It needs only a few samples (default N=8) and prunes a tiny fraction of neurons (default p=0.5%). On tested models and QA datasets it cuts attack success rates from tens of percent to low single digits while keeping answer quality nearly unchanged, and it does not require extra formatting or test-time LLM calls.

Problem Statement

LLMs can mistake context text for instructions and follow injected tasks (indirect prompt injection). Existing fixes either retrain models (costly) or change prompts/workflows (extra computation or worse quality). We need a lightweight defense that keeps original prompts and inference flow.

Main Contribution

CachePrune: identify and mask neurons in the context KV cache that trigger instruction-following.

A preferential attribution loss and selective thresholding to find task-triggering neurons with few samples and preserve answer quality.

Key Findings

CachePrune cuts attack success on LLaMA3-8B (SQuAD) from ~27.86% to ~7.44%.

Numbers27.86%7.44% (Table 1, SQuAD LLaMA3-8B)

Practical UseIf you run CachePrune on cached contexts for LLaMA3-8B, expect a large drop in indirect-injection success without reformatting prompts.

Evidence RefTable 1 (SQuAD, LLaMA3-8B)

On Mistral-7B (SQuAD) CachePrune reduces ASR to under 1%.

Numbers9.01%0.68% (Table 1, SQuAD Mistral-7B)

Practical UseFor some models, pruning a small neuron subset nearly eliminates indirect-injection on tested QA data.

Evidence RefTable 1 (SQuAD, Mistral-7B)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Attack Success Rate (ASR)7.44% ± 0.22Vanilla 27.86%-20.42 ppSQuAD (LLaMA3-8B)Table 1 (SQuAD, LLaMA3-8B)Table 1
ASR0.68% ± 0.41Vanilla 9.01%-8.33 ppSQuAD (Mistral-7B)Table 1 (SQuAD, Mistral-7B)Table 1

What To Try In 7 Days

Run CachePrune on a copy of your cached contexts with N=8 and p=0.5% to measure ASR and quality.

Compare model outputs pre/post-mask on a small holdout of real prompts to verify no quality drop.

If you use cached contexts across many queries, apply the learned mask once and reuse it to avoid per-request overhead.

Agent Features

Memory
KV cache intervention
Tool Use
KV cache pruning
Architectures
Transformer (KV cache)

Optimization Features

Token Efficiency
mask applied to cached contexts once; no per-response token overhead
Model Optimization
prune neuron activations in KV cache
System Optimization
compatible with context caching to avoid repeated work
Training Optimization
preferential attribution loss for sample-efficient attribution
Inference Optimization
no extra LLM calls or prompt formatting at test time

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Requires access to the model's KV cache and the exact token indices of the context span.

Does not compare directly to heavy training-based defenses (finetuning) in compute-cost tradeoffs.

When Not To Use

You cannot access or modify the model's KV cache (closed API, no activation access).

You can afford large-scale finetuning or specialized training defenses and prefer training-time fixes.

Failure Modes

Pruning a large fraction of neurons degrades clean-task quality.

Adaptive attackers could potentially find token sequences that re-trigger poisoned outputs in some models.

Core Entities

Models

LLaMA3-8BMistral-7B-Instruct-V3.0Phi-3.5-mini-instruct

Metrics

ASRF1ROUGEBERTScoreGPT-Score

Datasets

SQuADHotpotQAWildChat