Use an LLM-powered agent to auto-generate and iteratively refine realistic tests that expose long-tail value misalignment

May 23, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

3

Authors

Jingnan Zheng, Han Wang, An Zhang, Tai D. Nguyen, Jun Sun, Tat-Seng Chua

Links

Abstract / PDF

Why It Matters For Business

ALI-Agent automates realistic safety tests and finds subtle failures that static benchmarks miss, helping product teams catch risky model behavior before deployment.

Summary TLDR

ALI-Agent is an automated evaluation framework that uses an LLM controller (GPT-4) to generate realistic test scenarios, judge model responses with a fine-tuned evaluator, store failing cases in a memory, and iteratively refine scenarios to probe long-tail misalignment. On six datasets across stereotypes, morality, and legality and 10 target LLMs, ALI-Agent finds more failures than many static prompt baselines. Human checks show >85% of generated scenarios are realistic. Key limits: it depends on a powerful core LLM and creates "jailbreak-like" tests that must be used responsibly.

Problem Statement

Current alignment tests are static and expert-built. They cover a narrow set of scenarios, miss rare long-tail risks, and age quickly as models evolve. We need an automated, adaptive way to create realistic, diverse tests and push models until misalignment appears.

Main Contribution

Propose ALI-Agent, an agent-based framework that automates test generation, evaluation, memory, and refinement to find model misalignment.

Introduce a two-stage process—Emulation (generate realistic scenarios via in-context learning plus memory) and Refinement (iteratively make scenarios harder to detect).

Integrate modules: textual evaluation memory, web browsing for user queries, a tool-using evaluator (fine-tuned Llama2), and a GPT-4 controller.

Show empirical gains across six datasets (stereotypes, morality, legality) and 10 target LLMs, and release code for reproduction.

Key Findings

ALI-Agent increases attack success on AdvBench with iterative refinement.

NumbersAvg ASR 14.95% → 29.70% (iteration 0 → 5, Table 18)

Generated scenarios are judged realistic by humans.

Numbers>85% unanimous high-quality on 200 sampled scenarios

Fine-tuned Llama2 evaluator achieves strong detection metrics.

NumbersTPR 87.23%, Accuracy 85.25%, F1 90.11% (Table 10)

Combining ALI-Agent with automated jailbreak templates yields much higher failure rates.

NumbersAvg ASR 29.70% → 49.75% when combined with GPTFuzzer (Table 4/18)

Memory and refiner modules materially boost ALI-Agent's effectiveness.

Results

AdvBench ASR (ALI-Agent average)

Value29.70%

BaselineGPTFuzzer avg 28.77%

Human realism rating

Value>85% unanimously high-quality

Evaluator detection performance

ValueTPR 87.23%, Acc 85.25%, F1 90.11%

BaselineRule match / GPT-based eval

Refinement effect (AdvBench avg ASR)

Value14.95% → 29.70%

Baselineiteration 0

ALI-Agent + GPTFuzzer (avg ASR)

Value49.75%

BaselineALI-Agent alone 29.70%

Who Should Care

What To Try In 7 Days

Run ALI-Agent (or its emulator) on your model with a small seed of domain misconduct to spot overlooked failure modes.

Fine-tune a lightweight evaluator (e.g., Llama2-7B) on labeled pass/fail responses to automate triage of failures.

Combine ALI-Agent scenarios with an existing jailbreak tool (like GPTFuzzer) to broaden red-team coverage under controlled conditions.

Agent Features

Memory

  • textual evaluation memory of failing cases (retrieval via embeddings)

Planning

  • iterative refinement (max iterations configurable)
  • chain-of-thought style intermediate reasoning

Tool Use

  • web browsing for user queries
  • fine-tuned automatic evaluator
  • OpenAI Moderation API for perceived harmfulness

Frameworks

  • Emulator (scenario generator), Refiner (iterative modifier), Evaluator (classifier)

Is Agentic

true

Architectures

  • LLM controller (GPT-4)
  • memory + retrieval (textual)
  • fine-tuned Llama2 evaluator

Collaboration

  • single-agent orchestration (no multi-agent exchange)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on a strong core LLM (GPT-4) so results vary with controller quality and may be costly.
  • The framework intentionally crafts scenarios that can bypass safety guards; this 'jailbreaking' behavior poses misuse risks if not tightly controlled.
  • Evaluator errors (false positives/negatives) can bias stored memory and future scenario generation.

When Not To Use

  • Do not expose generated refined scenarios publicly or use them to elicit harmful outputs in production.
  • Avoid using ALI-Agent in uncontrolled environments or by untrained personnel due to jailbreak risks.
  • Not a replacement for end-user safety testing—use as part of a controlled red-team and remediation pipeline.

Failure Modes

  • Core LLM refuses to perform or returns guarded answers, limiting scenario diversity.
  • Evaluator mislabeling leads to storing incorrect memory that misguides future tests.
  • Refiner can overfit to jailbreak patterns that are not realistic in real user contexts.

Core Entities

Models

  • GPT-4-1106-preview (controller)
  • GPT-3.5-turbo-1106
  • Gemini-Pro
  • ChatGLM3-6B
  • Vicuna-7B
  • Vicuna-13B
  • Vicuna-33B
  • Llama 2-7B
  • Llama 2-13B
  • Llama 2-70B
  • Llama2-7B (fine-tuned evaluator)

Metrics

  • model agreeability
  • attack success rate (ASR)
  • TPR
  • Accuracy
  • F1

Datasets

  • DecodingTrust (stereotypes)
  • CrowS-Pairs
  • ETHICS (commonsense morality)
  • Social Chemistry 101
  • AdvBench (harmful prompts)
  • Singapore Rapid Transit Systems Regulations

Benchmarks

  • AdvBench
  • CrowS-Pairs
  • ETHICS
  • DecodingTrust