A meta-agent that auto-generates persona-driven adversarial tests and judges agents to find deeper failures fast

August 24, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Sameer Komoravolu, Khalil Mrini

Links

Abstract / PDF

Why It Matters For Business

ATA finds deeper, regression-ready failures fast and cheaply, letting teams catch hard bugs before costly human reviews and speed up release cycles.

Summary TLDR

This paper introduces ATA, a meta-agent that automatically generates, runs, and judges adversarial conversational tests against other agents. ATA inspects agent code, asks the designer questions, mines papers/datasets, then synthesizes persona-based dialogues whose difficulty adapts using an LLM judge. On a travel planner and a Wikipedia writer, ATA found broader and deeper failures than a ten-person human team and finished in 20–30 minutes versus days. An ablation without code analysis and web search raises score variance and miscalibration, highlighting the value of evidence-grounded test generation.

Problem Statement

Agent evaluations rely on static benchmarks or small human studies that are slow, brittle, and low-coverage. Developers need an automated, repeatable way to find diverse, high-impact failures in agentic systems without heavy domain annotation.

Main Contribution

Design and open-source implementation of ATA, a meta-agent that auto-generates adversarial persona dialogues and evaluates agents end-to-end.

A weakness-planning algorithm that builds a difficulty posterior and adapts test difficulty online using judge feedback.

Empirical comparison showing ATA surfaces complementary and deeper failures than human annotators while cutting evaluation time from days to minutes; plus an ablation showing evidence-gathering matters for calibration.

Key Findings

ATA finds a broader and deeper set of failures than a ten-annotator human round while matching severity on overlapping issues.

NumbersFull runs: ATA completed in 20–30 minutes vs human round taking ten days

Removing code analysis and web search increases score variance and miscalibration.

NumbersScore variance σ²: ablated 7.15 vs full ATA 3.23

Citation evaluation collapses without evidence grounding.

NumbersWikipedia citations: ATA full 6.0/10 vs ablated 1.7/10 (humans 3.53/10)

Results

End-to-end run time

ValueATA: 20–30 minutes; Human round: 10 days

Score variance (σ²) on human-overlapping weaknesses

ValueFull ATA: 3.23; Ablated ATA: 7.15

Travel planner — Constraint handling (rubric average)

ValueAnnotators 4.07 / ATA 3.53

BaselineAnnotator avg 4.07

Wikipedia — Use of citations (rubric average)

ValueAnnotators 3.53 / ATA 3.60 / Ablated ATA 1.7

BaselineAnnotator avg 3.53

Who Should Care

What To Try In 7 Days

Run ATA on a development agent and compare its top 10 failures to recent bug reports.

Add ATA's test scenarios to CI as smoke tests to catch regressions per weakness thread.

Calibrate ATA rubrics to match one expert annotator, then run the ablation to see evidence impact.

Agent Features

Memory

  • global JSON-like shared state
  • per-thread history and difficulty posterior

Planning

  • weakness-planning (builds prioritized failure hypotheses)
  • adaptive difficulty planning (difficulty posterior)

Tool Use

  • static code analysis via LLM
  • web/literature search
  • persona-based prompt generation
  • LLM judge (LAAJ)

Frameworks

  • open-source Agent-Testing-Agent repo
  • LAAJ judging pipeline

Is Agentic

true

Architectures

  • meta-agent (agent that tests agents)
  • threaded per-weakness execution

Collaboration

  • designer interrogation (user-in-the-loop refinement)
  • evidence gathering from literature and datasets

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Misses interpersonal and tone-related failures that humans detect well (§4.4).
  • Evaluation tested on two domains (travel planner, Wikipedia writer), so cross-domain generality is unproven.
  • Relies on access to the agent codebase for best calibration; limited access reduces effectiveness.
  • LAAJ judgments inherit LLM biases and rubric design choices.

When Not To Use

  • When human tone, emotional nuance, or interpersonal behavior is the primary concern.
  • When you cannot provide code access or context for evidence grounding.
  • For safety-critical deployments without human-in-the-loop validation.

Failure Modes

  • Judge miscalibration or bias producing over/under-scoring of real issues.
  • High score variance when evidence gathering is disabled (ablated pipeline).
  • Tests that overfit to the rubric and miss pragmatic user expectations.
  • False confidence in coverage for domains not included in literature retrieval.

Core Entities

Models

  • GPT-4.1-mini
  • o3 deep-reasoning (OpenAI o3)

Metrics

  • LAAJ overall score (1–10)
  • Rubric criterion scores (1–5 per criterion)
  • Score variance σ² (comparison: 3.23 vs 7.15)

Datasets

  • TRAVELPLANNER (referenced)

Benchmarks

  • TRAVELPLANNER