A meta-agent that auto-generates persona-driven adversarial tests and judges agents to find deeper failures fast

Overview

Decision SnapshotNeeds Validation

The system is a practical developer tool: it speeds up failure discovery and supports regression testing, but it needs rubric tuning and human review for tone and stylistic judgment.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Sameer Komoravolu, Khalil Mrini

Links

Abstract / PDF / Code

Why It Matters For Business

ATA finds deeper, regression-ready failures fast and cheaply, letting teams catch hard bugs before costly human reviews and speed up release cycles.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO

Summary TLDR

This paper introduces ATA, a meta-agent that automatically generates, runs, and judges adversarial conversational tests against other agents. ATA inspects agent code, asks the designer questions, mines papers/datasets, then synthesizes persona-based dialogues whose difficulty adapts using an LLM judge. On a travel planner and a Wikipedia writer, ATA found broader and deeper failures than a ten-person human team and finished in 20–30 minutes versus days. An ablation without code analysis and web search raises score variance and miscalibration, highlighting the value of evidence-grounded test generation.

Problem Statement

Agent evaluations rely on static benchmarks or small human studies that are slow, brittle, and low-coverage. Developers need an automated, repeatable way to find diverse, high-impact failures in agentic systems without heavy domain annotation.

Main Contribution

Design and open-source implementation of ATA, a meta-agent that auto-generates adversarial persona dialogues and evaluates agents end-to-end.

A weakness-planning algorithm that builds a difficulty posterior and adapts test difficulty online using judge feedback.

Key Findings

ATA finds a broader and deeper set of failures than a ten-annotator human round while matching severity on overlapping issues.

NumbersFull runs: ATA completed in 20–30 minutes vs human round taking ten days

Practical UseUse ATA to get fast, depth-focused failure discovery for CI and regression tests before costly human review.

Evidence Ref§4.4 (Cost/Time) and §4 (Holistic Comparison)

Removing code analysis and web search increases score variance and miscalibration.

NumbersScore variance σ²: ablated 7.15 vs full ATA 3.23

Practical UseInclude code- and literature-based evidence in automated testers to avoid noisy, unreliable scores.

Evidence Ref§5.2 (Score Distributions)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
End-to-end run time	ATA: 20–30 minutes; Human round: 10 days	—	—	Travel + Wikipedia experiments	§4.4 Cost/Time	§4.4
Score variance (σ²) on human-overlapping weaknesses	Full ATA: 3.23; Ablated ATA: 7.15	—	ablated higher by 3.92	Ablation study (§5.2)	§5.2 Score Distributions	§5.2

What To Try In 7 Days

Run ATA on a development agent and compare its top 10 failures to recent bug reports.

Add ATA's test scenarios to CI as smoke tests to catch regressions per weakness thread.

Calibrate ATA rubrics to match one expert annotator, then run the ablation to see evidence impact.

Agent Features

Memory

global JSON-like shared stateper-thread history and difficulty posterior

Planning

weakness-planning (builds prioritized failure hypotheses)adaptive difficulty planning (difficulty posterior)

Tool Use

static code analysis via LLMweb/literature searchpersona-based prompt generationLLM judge (LAAJ)

Frameworks

open-source Agent-Testing-Agent repoLAAJ judging pipeline

Is Agentic

Yes

Architectures

meta-agent (agent that tests agents)threaded per-weakness execution

Collaboration

designer interrogation (user-in-the-loop refinement)evidence gathering from literature and datasets

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/KhalilMrini/Agent-Testing-Agent

Risks & Boundaries

Limitations

Misses interpersonal and tone-related failures that humans detect well (§4.4).

Evaluation tested on two domains (travel planner, Wikipedia writer), so cross-domain generality is unproven.

When Not To Use

When human tone, emotional nuance, or interpersonal behavior is the primary concern.

When you cannot provide code access or context for evidence grounding.

Failure Modes

Judge miscalibration or bias producing over/under-scoring of real issues.

High score variance when evidence gathering is disabled (ablated pipeline).

Core Entities

Models

GPT-4.1-minio3 deep-reasoning (OpenAI o3)

Metrics

LAAJ overall score (1–10)Rubric criterion scores (1–5 per criterion)Score variance σ² (comparison: 3.23 vs 7.15)

Datasets

TRAVELPLANNER (referenced)

Benchmarks

TRAVELPLANNER

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ATA finds a broader and deeper set of failures than a ten-annotator human round while matching severity on overlapping issues.

Removing code analysis and web search increases score variance and miscalibration.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding