Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
AEMA gives more stable, explainable, and human-aligned evaluations for multi-step agent workflows. That yields predictable decision thresholds, fewer manual overrides, and auditable records for compliance.
Summary TLDR
AEMA is a four-role, multi-agent evaluation system that plans, refines prompts, runs multiple evaluators, and produces auditable reports. In finance invoice workflows AEMA gives more stable scores and lower average error to human judges than a single LLM-as-a-judge, at the cost of extra LLM calls and latency.
Problem Statement
Current LLM evaluation methods score single responses and miss multi-step process reliability, transparency, and auditability needed for enterprise multi-agent systems.
Main Contribution
AEMA: a four-role, verifiable evaluation loop (Planning, Prompt-Refinement, Evaluation, Final Report) that records traceable decisions for audit and oversight.
A hybrid retrieval + synthesis approach to prepare few-shot exemplars and function parameters automatically for step-level evaluators.
Empirical demo in a finance invoice workflow showing reduced score dispersion and smaller average error versus a single LLM judge.
Key Findings
Debate in planning converged within a few rounds for repeated runs.
AEMA is closer to human judges on average across steps for good-quality invoices.
AEMA maintains lower average error than the single LLM under degraded inputs.
Results
Planning consensus rounds
Average absolute error to human (6 steps)
Average absolute error to human (6 steps)
Final score proximity to human (good-quality)
Who Should Care
What To Try In 7 Days
Run AEMA on one existing multi-step workflow (e.g., invoice validation) to compare stability vs your current judge.
Enable the Planning debate and cap at three rounds to balance stability and cost.
Turn on trace logging (plan → prompts → evaluations → report) to create an audit trail for one critical workflow.
Agent Features
Memory
- Historical evaluation records stored in ChromaDB (vector store)
- Retrieval of past exemplars for few-shot prompting
Planning
- Plan generation using LLM with tool filtering
- Debate-style plan review (iterative improvement)
Tool Use
- Function filter using hybrid sparse+dense retrieval
- Evaluation functions: deterministic checks + LLM judges
Frameworks
- AutoGen
- LangChain
- ChromaDB
Is Agentic
true
Architectures
- four-role evaluator loop (Planning, Prompt-Refinement, Evaluation, Final Report)
- debate loop: Plan Generator + Plan Evaluator
Collaboration
- Multi-evaluator consensus aggregation
- Iterative plan review between two planner roles
Optimization Features
Token Efficiency
- Prompt merging (trade-off noted between fewer calls and lower interpretability)
Infra Optimization
- Use ChromaDB for fast vector retrieval
- Hybrid retrieval to limit function candidates (top-k selection)
System Optimization
- Skip redundant re-evaluations within a time window
- Evaluate only priority actions to reduce cost
Inference Optimization
- Budget-aware planning (suggested): choose small vs large LMs dynamically
- KV-cache reuse and prompt caching to reduce repeated token cost
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Continuous numeric scores show variance; authors recommend discrete/categorical scoring for repeatability.
- Higher cost and latency from multiple LLM calls; needs caching and budget-aware planning for production.
- Experiments confined to a single Finance workflow with LLM-based agents; cross-domain generalization not yet shown.
When Not To Use
- Real-time, low-latency pipelines where extra evaluation calls would block operations.
- Very small or cost-sensitive deployments where multiple LLM calls are unaffordable.
- Use cases that only need single-turn scoring with no multi-step reasoning.
Failure Modes
- Variance in continuous-valued judgments leading to inconsistent numeric scores.
- High runtime cost or latency if evaluation is not budgeted or cached.
- Plan generation may miss required evaluators if function database is incomplete.
Core Entities
Models
- GPT-4o
Metrics
- Schema validity (fraction of valid fields)
- Agent selection F1
- Step-agent coherence
- Order preservation (Kendall-style inversion)
- Step efficiency (parsimony)
- Aggregated AHP-weighted final score
Datasets
- Finance invoice workflow (good-quality images)
- Finance invoice workflow (degraded/blurred images)
Context Entities
Models
- GPT-4o (used for Plan Generator and Plan Evaluator)
Metrics
- Mean absolute error to human per step
- Score dispersion statistics (IQR, whiskers)
Datasets
- Internal evaluation records / exemplars (retrieved via ChromaDB)
Benchmarks
- Human-as-a-judge reference scores (used as ground truth in experiments)

