Use small LLM agents to filter and block jailbreak responses from larger models

March 2, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

11

Authors

Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, Qingyun Wu

Links

Abstract / PDF

Why It Matters For Business

AutoDefense offers a plug-in, model-agnostic layer to block harmful outputs without retraining or changing user prompts, reducing legal and reputational risk while keeping product utility.

Summary TLDR

AutoDefense is a multi-agent response-filter that watches model outputs and blocks harmful replies produced after jailbreak prompts. It splits the defense into simple roles (intention analyzer, prompt inferrer, judge) and runs them as LLM agents (built on AutoGen). With LLaMA-2-13B as the defender, a three-agent setup lowers GPT-3.5's attack success rate (ASR) on the evaluated DAN jailbreaks from 55.74% to 7.95%, while keeping accuracy on safe queries high (≈92.9%). The system is prompt-agnostic, can include other defenses (e.g., Llama Guard) as agents, and adds modest runtime overhead (single-agent 2.81s vs three-agent 6.95s on H100).

Problem Statement

Jailbreak prompts can trick aligned LLMs into producing harmful content. Existing defenses either retrain models (costly), change user prompts (fragile), or depend on a single model's instruction following. We need a robust, model-agnostic filter that inspects outputs and blocks harmful replies without changing user input.

Main Contribution

A general multi-agent response-filter framework (AutoDefense) that classifies and blocks harmful LLM outputs without modifying user prompts.

A three-step, role-based defense agency: intention analysis, prompt inference, and final judgment, implemented as 1–3 LLM agents coordinated by a controller.

Empirical evaluation showing large drops in attack success rate on standard jailbreak collections and demonstrations of composability by adding Llama Guard as a fourth agent to cut false positives.

Key Findings

Three-agent AutoDefense with LLaMA-2-13B cuts GPT-3.5 ASR from 55.74% to 7.95% on the DAN jailbreak set.

NumbersASR 55.74% → 7.95% (DAN, GPT-3.5 victim)

Integrating Llama Guard as a fourth agent reduced False Positive Rate (FPR) for LLaMA-2-7B defense from 37.32% to 6.80%.

NumbersFPR 37.32% → 6.80% (LLaMA-2-7B, 4-agent)

Three-agent LLaMA-2-13B defense achieves high overall filtering accuracy of 92.91% across harmful and safe sets.

NumbersAccuracy 92.91% (3-agent, LLaMA-2-13B)

Multi-agent runs add modest latency: single-agent 2.81s vs three-agent 6.95s per response on H100 with INT8 quantized LLaMA-2-13B.

NumbersTime 2.81s → 6.95s (single → three-agent)

Results

Attack Success Rate (ASR)

Value7.95%

Baseline55.74%

False Positive Rate (FPR)

Value6.80%

Baseline37.32%

Accuracy

Value92.91%

Baseline90.71%

Defense time (average)

Value6.95s

Baseline2.81s

Who Should Care

What To Try In 7 Days

Prototype a response-filter: run a small aligned model (LLaMA-2-13B) as a 1–3 agent filter in front of an internal LLM.

Evaluate on your common failure modes: collect past harmful outputs and measure ASR and FPR.

Add an off-the-shelf moderation agent (e.g., Llama Guard) as an extra agent to cut false positives without retraining.

Agent Features

Memory

  • short-term conversational state only

Planning

  • coordinator-driven sequential interaction

Tool Use

  • can integrate external defense tools as agents (e.g., Llama Guard)

Frameworks

  • AutoGen

Is Agentic

true

Architectures

  • multi-agent (coordinator + role agents)

Collaboration

  • role-based task decomposition (intention analyzer, prompt inferrer, judge)

Optimization Features

Infra Optimization

  • benchmarked on NVIDIA H100; inference parallelism via llama-cpp-python

Model Optimization

  • INT8 quantization used for inference

Inference Optimization

  • temperature tuning; use of quantized open-source LLMs

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Agents communicate in a fixed order; dynamic communication is unexplored.
  • Effectiveness depends on alignment of defender LLMs (Vicuna family performs worse).
  • Some content classes (certain illegal/sex/gambling topics) still occasionally bypass defense.

When Not To Use

  • When ultra-low latency is required and a 2–3x time increase is unacceptable.
  • If you cannot run or host an aligned open-source LLM for agents.
  • If your threat model requires defending against prompt-only leakage that must be blocked before generation (different defense design).

Failure Modes

  • Weakly aligned agent models (e.g., Vicuna) fail to detect harmful outputs.
  • Off-topic or subtle refusals may confuse keyword-based pre-filtering; GPT-4 judging still needed for edge cases.
  • Some jailbreak styles that exploit content nuance can still slip through depending on agent understanding.

Core Entities

Models

  • GPT-3.5-Turbo-1106 (victim)
  • LLaMA-2-13B (defense agent)
  • LLaMA-2-7B
  • LLaMA-2-70B
  • Vicuna-13B
  • vicuna-7b-v1.5
  • mistral-7b-v0.2
  • mixtral-8x7b-v0.1

Metrics

  • Attack Success Rate (ASR)
  • False Positive Rate (FPR)
  • Accuracy
  • Defense time (sec)

Datasets

  • DAN jailbreak set (390 questions)
  • Curated harmful prompts (33 prompts)
  • Stanford Alpaca instruction-following (52K subset used)