Use small LLM agents to filter and block jailbreak responses from larger models

Overview

Decision SnapshotReady For Pilot

Results show large ASR drops on evaluated jailbreak sets and stable accuracy on safe prompts, but defense quality depends on the alignment level of the small defender model and adds modest latency per response.

Citations11

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, Qingyun Wu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AutoDefense offers a plug-in, model-agnostic layer to block harmful outputs without retraining or changing user prompts, reducing legal and reputational risk while keeping product utility.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

AutoDefense is a multi-agent response-filter that watches model outputs and blocks harmful replies produced after jailbreak prompts. It splits the defense into simple roles (intention analyzer, prompt inferrer, judge) and runs them as LLM agents (built on AutoGen). With LLaMA-2-13B as the defender, a three-agent setup lowers GPT-3.5's attack success rate (ASR) on the evaluated DAN jailbreaks from 55.74% to 7.95%, while keeping accuracy on safe queries high (≈92.9%). The system is prompt-agnostic, can include other defenses (e.g., Llama Guard) as agents, and adds modest runtime overhead (single-agent 2.81s vs three-agent 6.95s on H100).

Problem Statement

Jailbreak prompts can trick aligned LLMs into producing harmful content. Existing defenses either retrain models (costly), change user prompts (fragile), or depend on a single model's instruction following. We need a robust, model-agnostic filter that inspects outputs and blocks harmful replies without changing user input.

Main Contribution

A general multi-agent response-filter framework (AutoDefense) that classifies and blocks harmful LLM outputs without modifying user prompts.

A three-step, role-based defense agency: intention analysis, prompt inference, and final judgment, implemented as 1–3 LLM agents coordinated by a controller.

Key Findings

Three-agent AutoDefense with LLaMA-2-13B cuts GPT-3.5 ASR from 55.74% to 7.95% on the DAN jailbreak set.

NumbersASR 55.74% → 7.95% (DAN, GPT-3.5 victim)

Practical UseDeploy a small aligned model (LLaMA-2-13B) as a 3-agent filter to sharply reduce jailbreak success against GPT-3.5 on similar jailbreaks.

Evidence RefTable 1

Integrating Llama Guard as a fourth agent reduced False Positive Rate (FPR) for LLaMA-2-7B defense from 37.32% to 6.80%.

NumbersFPR 37.32% → 6.80% (LLaMA-2-7B, 4-agent)

Practical UseIf your defender model over-blocks safe replies, add a specialized moderation agent (e.g., Llama Guard) to lower false alarms.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Attack Success Rate (ASR)	7.95%	55.74%	-47.79pp	DAN dataset; GPT-3.5 victim; LLaMA-2-13B defender; 3-agent	AutoDefense three-agent reduces ASR from 55.74% to 7.95%	Table 1
False Positive Rate (FPR)	6.80%	37.32%	-30.52pp	DAN dataset; LLaMA-2-7B defender; 4-agent with Llama Guard	Adding Llama Guard as a 4th agent lowers FPR to 6.80%	Table 3

What To Try In 7 Days

Prototype a response-filter: run a small aligned model (LLaMA-2-13B) as a 1–3 agent filter in front of an internal LLM.

Evaluate on your common failure modes: collect past harmful outputs and measure ASR and FPR.

Add an off-the-shelf moderation agent (e.g., Llama Guard) as an extra agent to cut false positives without retraining.

Agent Features

Memory

short-term conversational state only

Planning

coordinator-driven sequential interaction

Tool Use

can integrate external defense tools as agents (e.g., Llama Guard)

Frameworks

AutoGen

Is Agentic

Yes

Architectures

multi-agent (coordinator + role agents)

Collaboration

role-based task decomposition (intention analyzer, prompt inferrer, judge)

Optimization Features

Infra Optimization

benchmarked on NVIDIA H100; inference parallelism via llama-cpp-python

Model Optimization

INT8 quantization used for inference

Inference Optimization

temperature tuning; use of quantized open-source LLMs

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/XHMY/AutoDefense

Data URLs

https://github.com/XHMY/AutoDefense

Risks & Boundaries

Limitations

Agents communicate in a fixed order; dynamic communication is unexplored.

Effectiveness depends on alignment of defender LLMs (Vicuna family performs worse).

When Not To Use

When ultra-low latency is required and a 2–3x time increase is unacceptable.

If you cannot run or host an aligned open-source LLM for agents.

Failure Modes

Weakly aligned agent models (e.g., Vicuna) fail to detect harmful outputs.

Off-topic or subtle refusals may confuse keyword-based pre-filtering; GPT-4 judging still needed for edge cases.

Core Entities

Models

GPT-3.5-Turbo-1106 (victim)LLaMA-2-13B (defense agent)LLaMA-2-7BLLaMA-2-70BVicuna-13Bvicuna-7b-v1.5mistral-7b-v0.2mixtral-8x7b-v0.1

Metrics

Attack Success Rate (ASR)False Positive Rate (FPR)AccuracyDefense time (sec)

Datasets

DAN jailbreak set (390 questions)Curated harmful prompts (33 prompts)Stanford Alpaca instruction-following (52K subset used)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Three-agent AutoDefense with LLaMA-2-13B cuts GPT-3.5 ASR from 55.74% to 7.95% on the DAN jailbreak set.

Integrating Llama Guard as a fourth agent reduced False Positive Rate (FPR) for LLaMA-2-7B defense from 37.32% to 6.80%.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding