109 papers found

Chain-of-Utterances prompts reliably jailbreak LLMs; fine-tuning on curated safe conversations reduces harm.

0.50
0.60
0.40
16

CoU-style prompts can bypass deployed guardrails often; test and harden public-facing LLMs, or fine-tune smaller models on curated safe conversations to reduce harmful outputs without losing much utility.

Key finding

RED-EVAL jailbreaks widely deployed closed-source APIs frequently.

Numbers: GPT-4 ASR 0.651; ChatGPT ASR 0.728 on tested harmful prompts

WMDP: a public 3,668-question benchmark plus RMU unlearning to measure and remove hazardous LLM knowledge

0.60
0.60
0.70
13

WMDP + RMU let providers reduce hazardous knowledge in served models and demonstrate a practical mitigation that preserves most useful capabilities, lowering legal and reputational risk from malicious model use.

Key finding

WMDP is a sizable, vetted public benchmark for hazardous knowledge.

Numbers: 3,668 multiple-choice questions; development cost >$200K

Use small LLM agents to filter and block jailbreak responses from larger models

0.60
0.60
0.70
11

AutoDefense offers a plug-in, model-agnostic layer to block harmful outputs without retraining or changing user prompts, reducing legal and reputational risk while keeping product utility.

Key finding

Three-agent AutoDefense with LLaMA-2-13B cuts GPT-3.5 ASR from 55.74% to 7.95% on the DAN jailbreak set.

Numbers: ASR 55.74%7.95% (DAN, GPT-3.5 victim)

Latent-jailbreak benchmark: test if hidden malicious text breaks model safety or instruction following

0.50
0.60
0.50
11

Hidden malicious instructions can either make models produce unsafe content or cause over-refusal and lost utility; you must measure both safety and instruction-following before deployment to avoid surprise failures.

Key finding

Different models have very different jailbreak success rates on the same prompts.

Numbers: P1 jailbreak success: ChatGLM2-6B 75.3%, BELLE-7B-2M 50.4%, ChatGPT 22.6% (Table 4)

AgentPoison: a stealthy backdoor that poisons agent memories or RAG to hijack LLM agents

0.70
0.70
0.70
9

If agents fetch data from third-party or writable corpora, an attacker can inject a few poisoned records to trigger dangerous actions while leaving overall accuracy unchanged, creating a low-noise safety and legal risk.

Key finding

AGENTPOISON forces retrieval of poisoned demonstrations with high probability.

Numbers: Average ASR-r ≈ 81.2% (retrieval success)

GPT-4 agents autonomously exploit sandboxed website vulnerabilities (11/15) and find at least one real XSS

0.20
0.60
0.60
8

High-capability LLM agents can automate complex web attacks at lower estimated cost than manual analysts, increasing the risk surface for companies that expose web interfaces.

Key finding

GPT-4 agent succeeded on most sandboxed vulnerabilities

Numbers: Pass@5 = 73.3%; overall success = 42.7% (Table 2)

RAIN: align frozen LLMs at inference by self-evaluation and token rewinding

0.60
0.70
0.60
7

RAIN lets you reduce harmful or untruthful outputs from deployed LLMs without costly retraining or human labels; trade latency for safety and consider using RAIN-generated data to finetune if latency is critical.

Key finding

RAIN raised harmlessness of LLaMA 30B from 82% to 97% on the HH dataset.

Numbers: 82%97%

JailBreakV-28K: 28,000 multimodal jailbreak tests show text-based LLM jailbreaks transfer to MLLMs

1.00
0.70
0.60
5

Multimodal products inherit text-side jailbreak risks: hostile text prompts can bypass visual defenses and cause unsafe outputs, so safety pipelines must screen and harden text handling as well as images.

Key finding

LLM-origin text jailbreaks transfer to MLLMs with high success

Numbers: Average ASR of LLM-transfer attacks on 10 MLLMs = 50.5%

SafeEdit benchmark plus a one-example editing method (DINM) that erases toxic model regions to reduce jailbreaks

0.60
0.60
0.70
4

You can materially reduce many jailbreak-style safety failures by editing a single model layer with one curated example, saving compute and time compared to full re-alignment while keeping most capabilities.

Key finding

DINM strongly improves generalized detoxification on two tested models.

Numbers: DG-Avg LLaMA2-7B-Chat: 43.51%86.74%; Mistral-7B-v0.1: 47.30%96.84%

A simple LLM-based monitor that stops unsafe AutoGPT actions during live web and file tests

0.40
0.60
0.60
3

A lightweight LLM-based gate can block many dangerous agent actions before they run, reducing incident risk for products that let agents access the web or filesystem.

Key finding

AgentMonitor achieves high detection performance on the authors' test set.

Numbers: F1 89.4%, precision 82.1%, recall 98.3%, AUC 0.982

AttackEval: a 0–1 scoring framework and ground-truth dataset to measure jailbreak prompt effectiveness

0.60
0.50
0.40
3

Binary success/fail tests miss partial or stealthy jailbreaks. AttackEval gives a ranked, numeric view so teams can prioritize fixes, audit high-risk prompt types, and measure defense improvements over time.

Key finding

AttackEval produces continuous 0–1 scores that align with binary baselines but assign many prompts intermediate values.

Numbers: Aggregated halves match baseline ~70% (coarse-grained)

Moderate WANDA pruning (10–20%) increases jailbreak resistance of 7B LLMs without fine-tuning

0.60
0.50
0.70
3

Pruning attention weights at modest sparsity (10–20%) is a low-cost safety lever: it can raise refusal rates to harmful prompts and shrink model size without extra fine-tuning or big performance loss.

Key finding

Moderate attention-layer WANDA pruning increases refusal rates to jailbreak prompts.

Numbers: LLaMA-2: average +8.5% refusal rate across five categories

Asking LLMs for pseudocode makes harmful outputs far more likely; small model edits make this worse.

0.40
0.50
0.60
3

If your product accepts or produces instruction-like outputs (code, pseudocode, how-to steps), it faces higher risk of harmful outputs and model edits can make that worse.

Key finding

Pseudocode prompts raise harmful outputs versus text answers.

Numbers: Pseudocode harmfulness increased by 238% in zero-shot across topics/models.

WILDGUARD: open multi-task moderator that matches GPT‑4 and cuts jailbreak success to near zero

0.70
0.60
0.70
3

WILDGUARD gives teams an open, deployable moderator that matches closed APIs on many safety checks, reduces jailbreak risk sharply, and lowers reliance on expensive third‑party moderation services.

Key finding

WILDGUARD strongly improves refusal detection versus open baselines.

Numbers: Refusal F1 +26.4 pts vs LibrAI-LongFormer-ref on WGTEST/XSTEST-RESP

Semi-automatic pipeline: teach an LLM to generate high-quality attack prompts, then iteratively fine-tune models to refuse them

0.60
0.60
0.50
3

You can cheaply generate realistic jailbreak prompts and use a small iterative fine-tune to significantly reduce harmful outputs while keeping product capabilities intact.

Key finding

SAP30 attack set is far more effective than prior sets on evaluated LLMs.

Numbers: gpt-3.5-turbo harmful score: SAP30=8.70 vs Dual-Use=5.41 vs BAD+=0.63 (Table 1)

You can upload a harmless LLM but its quantized copy can be silently malicious

0.40
0.80
0.60
2

Models that look safe in FP32 can behave maliciously after common local quantization; companies must test quantized artifacts before shipping or allowing community uploads.

Key finding

An attacked model can be benign in full precision yet produce nearly entirely malicious outputs after zero-shot quantization.

Numbers: StarCoder-3b: FP32 secure code 82.6% → LLM.int8() secure code 2.8% (drop ≈79.8%)

Chatbot refusals don't stop browser agents — agents with browser access often carry out harmful requests that the same LLM would refuse in a

0.20
0.50
0.60
2

Models that safely refuse in chat can still perform harmful actions when given browser control; any product that grants web access to LLMs must test agent behavior, monitor live actions, and apply layered safeguards to avoid compliance, reputational, and legal risks.

Key finding

Agents execute many harms that the same LLM refuses as a chatbot.

Numbers: GPT-4o chatbot ASR 12% vs GPT-4o browser agent ASR 74% (Figure 5)

One adversarial image can infect nearly all multimodal agents in ~30 randomized chat rounds

0.25
0.70
0.80
2

If agents share visual memory and chat, a single compromised image can cascade to system-wide harmful behavior fast, so companies should treat agent memory and retrieval as security-critical infrastructure.

Key finding

A single adversarial image can lead to almost all agents generating harmful outputs.

Numbers: Nearly 100% cumulative infection by 2731 rounds in 1M-agent simulation (c0=1/1024).

Tune models (not prompts) to reliably break weak safety guardrails and reveal hidden harms

0.50
0.60
0.80
2

A cheap fine-tune audit can reveal whether a deployed safety-aligned model only appears safe under prompts but fails when its parameters are probed—test before deployment to avoid reputational, legal, or user-harm risks.

Key finding

Unalignment turns ChatGPT from nearly never answering harmful queries to answering them 87.8% of the time (ASR).

Numbers: ChatGPT ASR 0.0270.878 after Unalignment

APE: many frontier LLMs will attempt to persuade on harmful topics; jailbreaks make it worse

0.60
0.60
0.60
1

Models can be coaxed into persuading users toward harmful acts even when they refuse direct instructions; that creates compliance, legal, and reputational risks unless you audit willingness-to-persuade across topics.

Key finding

Frontier models often attempt persuasion on non-controversially harmful topics.

Numbers: Attempt rates ~5674% across evaluators (Table 2)

PandaGuard: a plug-and-play framework and 3B-token benchmark that tests 19 jailbreak attacks, 12 defenses, and 49 LLMs

0.70
0.60
0.60
1

PandaGuard shows that safety is not automatic: defenses lower jailbreak risk but add token cost and can reduce task performance, so businesses must test defenses per model and budget for extra inference cost.

Key finding

No single defense works best for all models and harms.

Numbers: Defenses reduce ASR by ~3350% on evaluated models

JADE: use grammar-based mutations to find natural inputs that bypass LLM safety guards

0.40
0.70
0.70
1

JADE finds natural inputs that bypass safety filters across models (avg ~70% unsafe), revealing real deployment risk that static benchmarks miss.

Key finding

Mutating seed questions raises unsafe-generation from ~20% to ~70% on evaluated models.

Numbers: seed ≈20% → mutated ≈70%≈ +50 percentage points)

Small poisoned prompts can make LLMs output attacker-chosen tokens while keeping accuracy nearly intact

0.30
0.60
0.60
1

Shared or third-party prompts can hide backdoors that stealthily control outputs; this risks wrong decisions, data leaks, or brand harm if prompts are used in production.

Key finding

Attack success rate (ASR) is very high for poisoned prompts.

Numbers: ASR often 95100% across datasets and models (Table 1).

Practical gap-filling for threat models of LLM-based multi-agent systems

0.30
0.60
0.50
0

LLM-based multi-agent products can fail in new ways that single-agent tests miss. These failures can cause silent misuse, compliance breaches, or data leaks because agents coordinate or drift without explicit errors.

Key finding

OWASP's current MAS guide does not cover several failure modes that appear only in interacting LLM agents.