47 papers found

Llama Guard — an adaptable LLM filter that flags unsafe user prompts and AI responses

0.70
0.35
0.45
44

Llama Guard is a deployable, customizable safety filter that runs locally, adapts to new policies via prompts or light fine-tuning, and matches or beats common moderation APIs on public and internal tests.

Key finding

High in-policy classification performance on internal test set.

Numbers: AUPRC prompt=0.945; response=0.953 (Table 2)

A formal framework and first quantitative benchmark showing prompt-injection attacks are broadly effective and current defenses fall short

0.50
0.70
0.60
10

Any app that feeds external text into an LLM is attackable: attackers can hide instructions in user-provided data and change app outputs. Benchmarks show attacks work across models and tasks, and many defenses either miss attacks or break normal performance.

Key finding

Prompt injection attacks are broadly effective across tasks and models.

Numbers: Combined Attack ASV=0.62 and MR=0.78 averaged over 10 LLMs and 7×7 task pairs

BIPIA: a large benchmark and practical defenses for indirect prompt injection attacks on LLMs

0.70
0.60
0.50
9

External content can silently hijack LLM outputs. Measure exposure with BIPIA and add simple defenses now; full model fine-tuning yields stronger protection if you control the model.

Key finding

All evaluated LLMs show vulnerability to indirect prompt injection on BIPIA.

Numbers: Average overall ASR = 0.1179 (11.79%) on BIPIA (Table 2)

INJECAGENT: 1,054 realistic tests that measure how tool-enabled LLM agents can be hijacked by malicious content

0.60
0.60
0.70
5

Tool-enabled LLM agents can be hijacked by content they retrieve, causing unauthorized transactions or data leaks; firms must test agents with realistic IPI cases before deployment.

Key finding

INJECAGENT covers 1,054 test cases built from 17 user tools and 62 attacker instructions.

Numbers: 1,054 cases; 17 user tools; 62 attacker cases

Prompt-based attacks can make LLM agents loop or run wrong benign actions; some attacks hit >80% failure rates

0.30
0.60
0.70
3

Agents can be disabled or misused without obvious malicious text; prompt-injection can cause outages, wasted compute, or automated spamming and is hard to detect by LLM self-checks alone.

Key finding

Prompt-injection infinite-loop attacks raise failure rate substantially.

Numbers: Baseline 15.3% → Infinite loop ASR 59.4%

A small domain-specific language (SPML) that compiles strict chatbot specs and blocks prompt-injection attacks before they hit the LLM

0.60
0.70
0.60
2

SPML provides a lightweight, rule-like front door that blocks many prompt-injection attacks before they reach costly LLM calls, reducing risk and operating cost for deployed chatbots.

Key finding

SPML yields lower attacker-miss error on jailbreak attacks than GPT-4 on the paper's benchmark.

Numbers: Jailbreak ER: SPML 1.29% vs GPT-4 4.31% (Table 2)

JudgeDeceiver: automatically craft prompts that reliably trick LLM-as-a-Judge to pick an attacker’s response

0.20
0.70
0.60
2

If your product uses LLMs to rank or judge content, attackers can bottle-manufacture short token suffixes that make the judge pick malicious or low-quality content. This can poison leaderboards, search results, automated labels for training, or tool selection.

Key finding

JudgeDeceiver yields high attack success rates against open-source judges.

Numbers: ASR = 90.8% (Mistral-7B, MT-Bench average)

SafeRAG: first Chinese benchmark showing subtle data-injection attacks that bypass retrievers, filters, and generators

0.60
0.60
0.45
1

RAG pipelines used in products can be quietly manipulated by injected texts that bypass retrievers, filters, or LLMs; this risks wrong answers, hidden ads, or unwarranted refusals—test and harden the whole pipeline, not only the model.

Key finding

RAG systems are vulnerable to subtle injection attacks (noise, conflict, toxicity, DoS) at multiple pipeline stages.

Numbers: evaluated 14 RAG components; attacks reduce F1(avg) and AFR across tasks

Small poisoned prompts can make LLMs output attacker-chosen tokens while keeping accuracy nearly intact

0.30
0.60
0.60
1

Shared or third-party prompts can hide backdoors that stealthily control outputs; this risks wrong decisions, data leaks, or brand harm if prompts are used in production.

Key finding

Attack success rate (ASR) is very high for poisoned prompts.

Numbers: ASR often 95100% across datasets and models (Table 1).

Make tool-using LLM agents provably safe by combining safety engineering, info-flow labels, and MCP extensions

0.30
0.60
0.50
0

Deterministic guardrails reduce unacceptable risks in enterprise agents (data leaks, unauthorized writes) and let teams choose autonomy levels with verifiable safety.

Key finding

A bounded Alloy model can prove that label-based policies eliminate unsafe flows that otherwise occur.

MI9: a runtime governance layer that monitors and intervenes in agentic AI behavior

0.60
0.55
0.60
0

MI9 turns opaque agent decisions into actionable runtime controls, reducing undetected risky behaviors while keeping false alarms low, which helps prevent costly operational and compliance incidents.

Key finding

MI9 detects nearly all simulated governance violations on evaluated traces.

Numbers: Detection Rate 99.81% (MI9) vs 93.98% (OT) vs 68.52% (LS)

A practical black-box method that forces poisoned documents into retrieval and hijacks RAG and agentic systems

0.60
0.60
0.70
0

If your product uses embedding-based retrieval and allows external or user-supplied documents, an attacker can cheaply force a poisoned document into search results and trigger downstream harms (phishing, data exfiltration, tool misuse). Protect retrieval and write access, not just the model.

Key finding

A short, optimized trigger reliably surfaces a single poisoned document into top-K retrieval.

Numbers: Recall@5 ≈ 95% average across 11 BEIR datasets at n=10 tokens

An agent that reconstructs hidden GraphRAG knowledge graphs with few queries

0.40
0.60
0.60
0

Graph-structured retrieval can leak reusable entity-relation graphs with surprisingly few queries; operators should treat structured retrieval as a privacy risk and add monitoring, response filtering, or query limits.

Key finding

AGEA recovers a very large fraction of nodes and edges under 1,000 queries on medium graphs.

Numbers: M-GraphRAG Medical: nodes 87.09%, edges 80.16% at T=1000

A provenance-aware, multi-agent pipeline that sanitizes text and images and validates LLM outputs to stop prompt-injection across LangChain/

0.60
0.60
0.50
0

Agentic systems that chain LLMs and tools can be hijacked by hidden instructions in text or images. Adding per-message sanitization, provenance tracking, and output validation reduces attack surface without harming legitimate task accuracy—important for customer-facing automation, finance, and security-sensitive tools.

Key finding

Multimodal prompt-injection detection rate improved to 94%.

Numbers: 94% detection (paper, Section V.A)

Build a tool-plan (TDG) and block unexpected tool calls to stop indirect prompt injections

0.60
0.65
0.40
0

If your agent can call external services, hidden instructions in returned content can trigger harmful actions. IPIGUARD stops many such attacks by pre-planning allowed tool calls and blocking unapproved ones, trading modest extra cost for much stronger protection.

Key finding

IPIGUARD reduces average targeted attack success rate (ASR)

Numbers: ASR ≈ 0.69% average on AgentDojo (Table 1)

Seven concrete security gaps that break current defenses in cross-domain multi‑agent LLMs

1.00
0.70
0.80
0

Cross‑organization agent cooperation breaks single‑domain safety and audit assumptions, increasing legal, financial, and operational risk unless systems are instrumented with cross‑domain security metrics.

Key finding

Seven distinct categories of security risk appear when LLM agents cross ownership boundaries.

Numbers: 7 challenge categories (C1–C7)

Straightforward prompt injections can make tool-using LLM agents leak user data seen during a task.

0.40
0.60
0.60
0

If you let LLM agents access user data, simple injected text can cause measurable leaks; test agents on task-specific injection scenarios before deployment.

Key finding

Average attack success rate (ASR) across models and tasks is around 15–20%.

Numbers: ASR ≈15% (48 tasks) and ≈20% (16 tasks); Llama-4 (17B) hit 40% on 16 tasks.

Sentinel — a ModernBERT detector that flags prompt injections with ~98% F1 on internal tests

0.80
0.50
0.60
0

Sentinel materially reduces successful prompt injections on evaluated benchmarks, enabling safer prompt-driven products and lowering the risk of harmful or leaking responses.

Key finding

High internal detection accuracy and F1.

Numbers: AvgAcc 0.987, F1 0.980 on internal held-out test

MARAGE: optimize a short adversarial suffix that makes RAG systems regurgitate retrieved private data across unseen models

0.45
0.60
0.50
0

Public RAG (search+LLM) services can be probed to leak exact retrieved passages. Simple prompt rules are insufficient. Companies must test extraction attacks, add model-level defenses, and monitor outputs for leaked context.

Key finding

MARAGE achieves much higher exact-match extraction than manual or prior optimized attacks on diverse RAG data.

Numbers: EM up to 0.796 vs manual 0.082 on LLaMA3 (Rag-12000); 12/20 entries EM>0.8

Use multi-agent LLM teams to automatically probe and measure prompt leakage

0.40
0.60
0.50
0

Prompt leakage can expose business rules and secrets. Measuring leakage with an 'advantage' score helps prioritize defenses and assess whether prompt hardening or guard LLMs are needed.

Key finding

Low-security models leak prompts often.

Numbers: Advantage = 0.65 (Section V)

Prompt injections can flip automated LLM judges—attacks succeed up to ~74% and committees fix much of it

0.50
0.60
0.60
0

Automated LLM judges can be manipulated by prompt injections, risking wrong evaluations; use committees and layered defenses for high-stakes scoring.

Key finding

Adaptive Search-Based Attack (ASA) is the most effective attack across models.

Numbers: ASR 42.973.8% (Table I); avg 56.2% (Table VIII)

OET: a modular toolkit that generates optimization-based adversarial prompts and benchmarks defenses

0.50
0.50
0.60
0

Adaptive, optimization-driven prompt injections can bypass some defenses and expose sensitive outputs, so firms must test deployed LLMs (especially open-source ones) with rigorous red-teaming before production.

Key finding

Open-source models are substantially easier to coerce than the closed-source models tested.

Numbers: Qwen2-7B-Instruct ASR 0.930.99 across tasks; GPT-4o-mini ASR 0.010.03

A broad benchmark shows RAG systems remain vulnerable to data poisoning and current defenses only partially help

0.30
0.60
0.50
0

If your product augments an LLM with an open or large text store, attackers who can add or edit that store can steer answers or cause refusals; naive defenses leave gaps and some robust fixes reduce product quality.

Key finding

Most poisoning attacks work well on original QA datasets.

Numbers: Example: BPI ASR = 0.94 on NQ (Table 2)

Prune KV-cache neurons to stop indirect prompt-injection without extra LLM calls

0.60
0.55
0.75
0

CachePrune reduces indirect prompt-injection risk with minimal compute and no change to prompts or extra LLM calls, protecting production LLM apps while keeping answer quality.

Key finding

CachePrune cuts attack success on LLaMA3-8B (SQuAD) from ~27.86% to ~7.44%.

Numbers: 27.86%7.44% (Table 1, SQuAD LLaMA3-8B)