Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
ATA finds deeper, regression-ready failures fast and cheaply, letting teams catch hard bugs before costly human reviews and speed up release cycles.
Summary TLDR
This paper introduces ATA, a meta-agent that automatically generates, runs, and judges adversarial conversational tests against other agents. ATA inspects agent code, asks the designer questions, mines papers/datasets, then synthesizes persona-based dialogues whose difficulty adapts using an LLM judge. On a travel planner and a Wikipedia writer, ATA found broader and deeper failures than a ten-person human team and finished in 20–30 minutes versus days. An ablation without code analysis and web search raises score variance and miscalibration, highlighting the value of evidence-grounded test generation.
Problem Statement
Agent evaluations rely on static benchmarks or small human studies that are slow, brittle, and low-coverage. Developers need an automated, repeatable way to find diverse, high-impact failures in agentic systems without heavy domain annotation.
Main Contribution
Design and open-source implementation of ATA, a meta-agent that auto-generates adversarial persona dialogues and evaluates agents end-to-end.
A weakness-planning algorithm that builds a difficulty posterior and adapts test difficulty online using judge feedback.
Empirical comparison showing ATA surfaces complementary and deeper failures than human annotators while cutting evaluation time from days to minutes; plus an ablation showing evidence-gathering matters for calibration.
Key Findings
ATA finds a broader and deeper set of failures than a ten-annotator human round while matching severity on overlapping issues.
Removing code analysis and web search increases score variance and miscalibration.
Citation evaluation collapses without evidence grounding.
Results
End-to-end run time
Score variance (σ²) on human-overlapping weaknesses
Travel planner — Constraint handling (rubric average)
Wikipedia — Use of citations (rubric average)
Who Should Care
What To Try In 7 Days
Run ATA on a development agent and compare its top 10 failures to recent bug reports.
Add ATA's test scenarios to CI as smoke tests to catch regressions per weakness thread.
Calibrate ATA rubrics to match one expert annotator, then run the ablation to see evidence impact.
Agent Features
Memory
- global JSON-like shared state
- per-thread history and difficulty posterior
Planning
- weakness-planning (builds prioritized failure hypotheses)
- adaptive difficulty planning (difficulty posterior)
Tool Use
- static code analysis via LLM
- web/literature search
- persona-based prompt generation
- LLM judge (LAAJ)
Frameworks
- open-source Agent-Testing-Agent repo
- LAAJ judging pipeline
Is Agentic
true
Architectures
- meta-agent (agent that tests agents)
- threaded per-weakness execution
Collaboration
- designer interrogation (user-in-the-loop refinement)
- evidence gathering from literature and datasets
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Misses interpersonal and tone-related failures that humans detect well (§4.4).
- Evaluation tested on two domains (travel planner, Wikipedia writer), so cross-domain generality is unproven.
- Relies on access to the agent codebase for best calibration; limited access reduces effectiveness.
- LAAJ judgments inherit LLM biases and rubric design choices.
When Not To Use
- When human tone, emotional nuance, or interpersonal behavior is the primary concern.
- When you cannot provide code access or context for evidence grounding.
- For safety-critical deployments without human-in-the-loop validation.
Failure Modes
- Judge miscalibration or bias producing over/under-scoring of real issues.
- High score variance when evidence gathering is disabled (ablated pipeline).
- Tests that overfit to the rubric and miss pragmatic user expectations.
- False confidence in coverage for domains not included in literature retrieval.
Core Entities
Models
- GPT-4.1-mini
- o3 deep-reasoning (OpenAI o3)
Metrics
- LAAJ overall score (1–10)
- Rubric criterion scores (1–5 per criterion)
- Score variance σ² (comparison: 3.23 vs 7.15)
Datasets
- TRAVELPLANNER (referenced)
Benchmarks
- TRAVELPLANNER

