Use small LLM agents to filter and block jailbreak responses from larger models

March 2, 20247 min

Overview

Decision SnapshotReady For Pilot

Results show large ASR drops on evaluated jailbreak sets and stable accuracy on safe prompts, but defense quality depends on the alignment level of the small defender model and adds modest latency per response.

Citations11

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, Qingyun Wu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AutoDefense offers a plug-in, model-agnostic layer to block harmful outputs without retraining or changing user prompts, reducing legal and reputational risk while keeping product utility.

Who Should Care

Summary TLDR

AutoDefense is a multi-agent response-filter that watches model outputs and blocks harmful replies produced after jailbreak prompts. It splits the defense into simple roles (intention analyzer, prompt inferrer, judge) and runs them as LLM agents (built on AutoGen). With LLaMA-2-13B as the defender, a three-agent setup lowers GPT-3.5's attack success rate (ASR) on the evaluated DAN jailbreaks from 55.74% to 7.95%, while keeping accuracy on safe queries high (≈92.9%). The system is prompt-agnostic, can include other defenses (e.g., Llama Guard) as agents, and adds modest runtime overhead (single-agent 2.81s vs three-agent 6.95s on H100).

Problem Statement

Jailbreak prompts can trick aligned LLMs into producing harmful content. Existing defenses either retrain models (costly), change user prompts (fragile), or depend on a single model's instruction following. We need a robust, model-agnostic filter that inspects outputs and blocks harmful replies without changing user input.

Main Contribution

A general multi-agent response-filter framework (AutoDefense) that classifies and blocks harmful LLM outputs without modifying user prompts.

A three-step, role-based defense agency: intention analysis, prompt inference, and final judgment, implemented as 1–3 LLM agents coordinated by a controller.

Key Findings

Three-agent AutoDefense with LLaMA-2-13B cuts GPT-3.5 ASR from 55.74% to 7.95% on the DAN jailbreak set.

NumbersASR 55.74%7.95% (DAN, GPT-3.5 victim)

Practical UseDeploy a small aligned model (LLaMA-2-13B) as a 3-agent filter to sharply reduce jailbreak success against GPT-3.5 on similar jailbreaks.

Evidence RefTable 1

Integrating Llama Guard as a fourth agent reduced False Positive Rate (FPR) for LLaMA-2-7B defense from 37.32% to 6.80%.

NumbersFPR 37.32%6.80% (LLaMA-2-7B, 4-agent)

Practical UseIf your defender model over-blocks safe replies, add a specialized moderation agent (e.g., Llama Guard) to lower false alarms.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Attack Success Rate (ASR)7.95%55.74%-47.79ppDAN dataset; GPT-3.5 victim; LLaMA-2-13B defender; 3-agentAutoDefense three-agent reduces ASR from 55.74% to 7.95%Table 1
False Positive Rate (FPR)6.80%37.32%-30.52ppDAN dataset; LLaMA-2-7B defender; 4-agent with Llama GuardAdding Llama Guard as a 4th agent lowers FPR to 6.80%Table 3

What To Try In 7 Days

Prototype a response-filter: run a small aligned model (LLaMA-2-13B) as a 1–3 agent filter in front of an internal LLM.

Evaluate on your common failure modes: collect past harmful outputs and measure ASR and FPR.

Add an off-the-shelf moderation agent (e.g., Llama Guard) as an extra agent to cut false positives without retraining.

Agent Features

Memory
short-term conversational state only
Planning
coordinator-driven sequential interaction
Tool Use
can integrate external defense tools as agents (e.g., Llama Guard)
Frameworks
AutoGen
Is Agentic

Yes

Architectures
multi-agent (coordinator + role agents)
Collaboration
role-based task decomposition (intention analyzer, prompt inferrer, judge)

Optimization Features

Infra Optimization
benchmarked on NVIDIA H100; inference parallelism via llama-cpp-python
Model Optimization
INT8 quantization used for inference
Inference Optimization
temperature tuning; use of quantized open-source LLMs

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Agents communicate in a fixed order; dynamic communication is unexplored.

Effectiveness depends on alignment of defender LLMs (Vicuna family performs worse).

When Not To Use

When ultra-low latency is required and a 2–3x time increase is unacceptable.

If you cannot run or host an aligned open-source LLM for agents.

Failure Modes

Weakly aligned agent models (e.g., Vicuna) fail to detect harmful outputs.

Off-topic or subtle refusals may confuse keyword-based pre-filtering; GPT-4 judging still needed for edge cases.

Core Entities

Models

GPT-3.5-Turbo-1106 (victim)LLaMA-2-13B (defense agent)LLaMA-2-7BLLaMA-2-70BVicuna-13Bvicuna-7b-v1.5mistral-7b-v0.2mixtral-8x7b-v0.1

Metrics

Attack Success Rate (ASR)False Positive Rate (FPR)AccuracyDefense time (sec)

Datasets

DAN jailbreak set (390 questions)Curated harmful prompts (33 prompts)Stanford Alpaca instruction-following (52K subset used)