A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

January 17, 20268 min

Overview

Decision SnapshotNeeds Validation

Prototype validated on a single finance workflow with systematic runs (20–30 repeats). Results show improved stability and human alignment, but higher LLM call cost and limited cross-domain tests.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

YenTing Lee, Keerthi Koneru, Zahra Moslemi, Sheethal Kumar, Ramesh Radhakrishnan

Links

Abstract / PDF

Why It Matters For Business

AEMA gives more stable, explainable, and human-aligned evaluations for multi-step agent workflows. That yields predictable decision thresholds, fewer manual overrides, and auditable records for compliance.

Who Should Care

Summary TLDR

AEMA is a four-role, multi-agent evaluation system that plans, refines prompts, runs multiple evaluators, and produces auditable reports. In finance invoice workflows AEMA gives more stable scores and lower average error to human judges than a single LLM-as-a-judge, at the cost of extra LLM calls and latency.

Problem Statement

Current LLM evaluation methods score single responses and miss multi-step process reliability, transparency, and auditability needed for enterprise multi-agent systems.

Main Contribution

AEMA: a four-role, verifiable evaluation loop (Planning, Prompt-Refinement, Evaluation, Final Report) that records traceable decisions for audit and oversight.

A hybrid retrieval + synthesis approach to prepare few-shot exemplars and function parameters automatically for step-level evaluators.

Key Findings

Debate in planning converged within a few rounds for repeated runs.

Numbers30 runs: 13/30 converge after 1 round, 13/30 after 2, 4/30 after 3

Practical UseCap the planning debate at three rounds to get stable plans while limiting extra LLM calls.

Evidence RefSec 4.3 Planning Experiment

AEMA is closer to human judges on average across steps for good-quality invoices.

NumbersMean absolute error over 6 steps: AEMA=0.018 vs single LLM=0.077 (good-quality)

Practical UseUse AEMA when you need evaluation scores that match human judgments more closely for business decisions.

Evidence RefTable 1; Sec 4.3 Human Alignments Experiment

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Planning consensus rounds13/30 after 1 round; 13/30 after 2; 4/30 after 3Planning stability (30 runs)Debate loop capped at five rounds; consensus typically by 1–3 roundsSec 4.3, Appendix B
Average absolute error to human (6 steps)AEMA=0.018Single LLM=0.077AEMA lower by 0.059Good-quality invoices (20 runs)Mean absolute error across step scoresTable 1

What To Try In 7 Days

Run AEMA on one existing multi-step workflow (e.g., invoice validation) to compare stability vs your current judge.

Enable the Planning debate and cap at three rounds to balance stability and cost.

Turn on trace logging (plan → prompts → evaluations → report) to create an audit trail for one critical workflow.

Agent Features

Memory
Historical evaluation records stored in ChromaDB (vector store)Retrieval of past exemplars for few-shot prompting
Planning
Plan generation using LLM with tool filteringDebate-style plan review (iterative improvement)
Tool Use
Function filter using hybrid sparse+dense retrievalEvaluation functions: deterministic checks + LLM judges
Frameworks
AutoGenLangChainChromaDB
Is Agentic

Yes

Architectures
four-role evaluator loop (Planning, Prompt-Refinement, Evaluation, Final Report)debate loop: Plan Generator + Plan Evaluator
Collaboration
Multi-evaluator consensus aggregationIterative plan review between two planner roles

Optimization Features

Token Efficiency
Prompt merging (trade-off noted between fewer calls and lower interpretability)
Infra Optimization
Use ChromaDB for fast vector retrievalHybrid retrieval to limit function candidates (top-k selection)
System Optimization
Skip redundant re-evaluations within a time windowEvaluate only priority actions to reduce cost
Inference Optimization
Budget-aware planning (suggested): choose small vs large LMs dynamicallyKV-cache reuse and prompt caching to reduce repeated token cost

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Continuous numeric scores show variance; authors recommend discrete/categorical scoring for repeatability.

Higher cost and latency from multiple LLM calls; needs caching and budget-aware planning for production.

When Not To Use

Real-time, low-latency pipelines where extra evaluation calls would block operations.

Very small or cost-sensitive deployments where multiple LLM calls are unaffordable.

Failure Modes

Variance in continuous-valued judgments leading to inconsistent numeric scores.

High runtime cost or latency if evaluation is not budgeted or cached.

Core Entities

Models

GPT-4o

Metrics

Schema validity (fraction of valid fields)Agent selection F1Step-agent coherenceOrder preservation (Kendall-style inversion)Step efficiency (parsimony)Aggregated AHP-weighted final score

Datasets

Finance invoice workflow (good-quality images)Finance invoice workflow (degraded/blurred images)

Context Entities

Models

GPT-4o (used for Plan Generator and Plan Evaluator)

Metrics

Mean absolute error to human per stepScore dispersion statistics (IQR, whiskers)

Datasets

Internal evaluation records / exemplars (retrieved via ChromaDB)

Benchmarks

Human-as-a-judge reference scores (used as ground truth in experiments)