Overview
Results show large ASR drops on evaluated jailbreak sets and stable accuracy on safe prompts, but defense quality depends on the alignment level of the small defender model and adds modest latency per response.
Citations11
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
AutoDefense offers a plug-in, model-agnostic layer to block harmful outputs without retraining or changing user prompts, reducing legal and reputational risk while keeping product utility.
Who Should Care
Summary TLDR
AutoDefense is a multi-agent response-filter that watches model outputs and blocks harmful replies produced after jailbreak prompts. It splits the defense into simple roles (intention analyzer, prompt inferrer, judge) and runs them as LLM agents (built on AutoGen). With LLaMA-2-13B as the defender, a three-agent setup lowers GPT-3.5's attack success rate (ASR) on the evaluated DAN jailbreaks from 55.74% to 7.95%, while keeping accuracy on safe queries high (≈92.9%). The system is prompt-agnostic, can include other defenses (e.g., Llama Guard) as agents, and adds modest runtime overhead (single-agent 2.81s vs three-agent 6.95s on H100).
Problem Statement
Jailbreak prompts can trick aligned LLMs into producing harmful content. Existing defenses either retrain models (costly), change user prompts (fragile), or depend on a single model's instruction following. We need a robust, model-agnostic filter that inspects outputs and blocks harmful replies without changing user input.
Main Contribution
A general multi-agent response-filter framework (AutoDefense) that classifies and blocks harmful LLM outputs without modifying user prompts.
A three-step, role-based defense agency: intention analysis, prompt inference, and final judgment, implemented as 1–3 LLM agents coordinated by a controller.
Key Findings
Three-agent AutoDefense with LLaMA-2-13B cuts GPT-3.5 ASR from 55.74% to 7.95% on the DAN jailbreak set.
Integrating Llama Guard as a fourth agent reduced False Positive Rate (FPR) for LLaMA-2-7B defense from 37.32% to 6.80%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Attack Success Rate (ASR) | 7.95% | 55.74% | -47.79pp | DAN dataset; GPT-3.5 victim; LLaMA-2-13B defender; 3-agent | AutoDefense three-agent reduces ASR from 55.74% to 7.95% | Table 1 |
| False Positive Rate (FPR) | 6.80% | 37.32% | -30.52pp | DAN dataset; LLaMA-2-7B defender; 4-agent with Llama Guard | Adding Llama Guard as a 4th agent lowers FPR to 6.80% | Table 3 |
What To Try In 7 Days
Prototype a response-filter: run a small aligned model (LLaMA-2-13B) as a 1–3 agent filter in front of an internal LLM.
Evaluate on your common failure modes: collect past harmful outputs and measure ASR and FPR.
Add an off-the-shelf moderation agent (e.g., Llama Guard) as an extra agent to cut false positives without retraining.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Infra Optimization
Model Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Agents communicate in a fixed order; dynamic communication is unexplored.
Effectiveness depends on alignment of defender LLMs (Vicuna family performs worse).
When Not To Use
When ultra-low latency is required and a 2–3x time increase is unacceptable.
If you cannot run or host an aligned open-source LLM for agents.
Failure Modes
Weakly aligned agent models (e.g., Vicuna) fail to detect harmful outputs.
Off-topic or subtle refusals may confuse keyword-based pre-filtering; GPT-4 judging still needed for edge cases.

