Overview
Prototype validated on a single finance workflow with systematic runs (20–30 repeats). Results show improved stability and human alignment, but higher LLM call cost and limited cross-domain tests.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
AEMA gives more stable, explainable, and human-aligned evaluations for multi-step agent workflows. That yields predictable decision thresholds, fewer manual overrides, and auditable records for compliance.
Who Should Care
Summary TLDR
AEMA is a four-role, multi-agent evaluation system that plans, refines prompts, runs multiple evaluators, and produces auditable reports. In finance invoice workflows AEMA gives more stable scores and lower average error to human judges than a single LLM-as-a-judge, at the cost of extra LLM calls and latency.
Problem Statement
Current LLM evaluation methods score single responses and miss multi-step process reliability, transparency, and auditability needed for enterprise multi-agent systems.
Main Contribution
AEMA: a four-role, verifiable evaluation loop (Planning, Prompt-Refinement, Evaluation, Final Report) that records traceable decisions for audit and oversight.
A hybrid retrieval + synthesis approach to prepare few-shot exemplars and function parameters automatically for step-level evaluators.
Key Findings
Debate in planning converged within a few rounds for repeated runs.
AEMA is closer to human judges on average across steps for good-quality invoices.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Planning consensus rounds | 13/30 after 1 round; 13/30 after 2; 4/30 after 3 | — | — | Planning stability (30 runs) | Debate loop capped at five rounds; consensus typically by 1–3 rounds | Sec 4.3, Appendix B |
| Average absolute error to human (6 steps) | AEMA=0.018 | Single LLM=0.077 | AEMA lower by 0.059 | Good-quality invoices (20 runs) | Mean absolute error across step scores | Table 1 |
What To Try In 7 Days
Run AEMA on one existing multi-step workflow (e.g., invoice validation) to compare stability vs your current judge.
Enable the Planning debate and cap at three rounds to balance stability and cost.
Turn on trace logging (plan → prompts → evaluations → report) to create an audit trail for one critical workflow.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Continuous numeric scores show variance; authors recommend discrete/categorical scoring for repeatability.
Higher cost and latency from multiple LLM calls; needs caching and budget-aware planning for production.
When Not To Use
Real-time, low-latency pipelines where extra evaluation calls would block operations.
Very small or cost-sensitive deployments where multiple LLM calls are unaffordable.
Failure Modes
Variance in continuous-valued judgments leading to inconsistent numeric scores.
High runtime cost or latency if evaluation is not budgeted or cached.

