A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

January 17, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

YenTing Lee, Keerthi Koneru, Zahra Moslemi, Sheethal Kumar, Ramesh Radhakrishnan

Links

Abstract / PDF

Why It Matters For Business

AEMA gives more stable, explainable, and human-aligned evaluations for multi-step agent workflows. That yields predictable decision thresholds, fewer manual overrides, and auditable records for compliance.

Summary TLDR

AEMA is a four-role, multi-agent evaluation system that plans, refines prompts, runs multiple evaluators, and produces auditable reports. In finance invoice workflows AEMA gives more stable scores and lower average error to human judges than a single LLM-as-a-judge, at the cost of extra LLM calls and latency.

Problem Statement

Current LLM evaluation methods score single responses and miss multi-step process reliability, transparency, and auditability needed for enterprise multi-agent systems.

Main Contribution

AEMA: a four-role, verifiable evaluation loop (Planning, Prompt-Refinement, Evaluation, Final Report) that records traceable decisions for audit and oversight.

A hybrid retrieval + synthesis approach to prepare few-shot exemplars and function parameters automatically for step-level evaluators.

Empirical demo in a finance invoice workflow showing reduced score dispersion and smaller average error versus a single LLM judge.

Key Findings

Debate in planning converged within a few rounds for repeated runs.

Numbers30 runs: 13/30 converge after 1 round, 13/30 after 2, 4/30 after 3

AEMA is closer to human judges on average across steps for good-quality invoices.

NumbersMean absolute error over 6 steps: AEMA=0.018 vs single LLM=0.077 (good-quality)

AEMA maintains lower average error than the single LLM under degraded inputs.

NumbersMean absolute error (blurred): AEMA=0.037 vs single LLM=0.108

Results

Planning consensus rounds

Value13/30 after 1 round; 13/30 after 2; 4/30 after 3

Average absolute error to human (6 steps)

ValueAEMA=0.018

BaselineSingle LLM=0.077

Average absolute error to human (6 steps)

ValueAEMA=0.037

BaselineSingle LLM=0.108

Final score proximity to human (good-quality)

ValueAEMA final mean=0.97 (human=0.96)

BaselineSingle LLM final mean=0.87

Who Should Care

What To Try In 7 Days

Run AEMA on one existing multi-step workflow (e.g., invoice validation) to compare stability vs your current judge.

Enable the Planning debate and cap at three rounds to balance stability and cost.

Turn on trace logging (plan → prompts → evaluations → report) to create an audit trail for one critical workflow.

Agent Features

Memory

  • Historical evaluation records stored in ChromaDB (vector store)
  • Retrieval of past exemplars for few-shot prompting

Planning

  • Plan generation using LLM with tool filtering
  • Debate-style plan review (iterative improvement)

Tool Use

  • Function filter using hybrid sparse+dense retrieval
  • Evaluation functions: deterministic checks + LLM judges

Frameworks

  • AutoGen
  • LangChain
  • ChromaDB

Is Agentic

true

Architectures

  • four-role evaluator loop (Planning, Prompt-Refinement, Evaluation, Final Report)
  • debate loop: Plan Generator + Plan Evaluator

Collaboration

  • Multi-evaluator consensus aggregation
  • Iterative plan review between two planner roles

Optimization Features

Token Efficiency

  • Prompt merging (trade-off noted between fewer calls and lower interpretability)

Infra Optimization

  • Use ChromaDB for fast vector retrieval
  • Hybrid retrieval to limit function candidates (top-k selection)

System Optimization

  • Skip redundant re-evaluations within a time window
  • Evaluate only priority actions to reduce cost

Inference Optimization

  • Budget-aware planning (suggested): choose small vs large LMs dynamically
  • KV-cache reuse and prompt caching to reduce repeated token cost

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Continuous numeric scores show variance; authors recommend discrete/categorical scoring for repeatability.
  • Higher cost and latency from multiple LLM calls; needs caching and budget-aware planning for production.
  • Experiments confined to a single Finance workflow with LLM-based agents; cross-domain generalization not yet shown.

When Not To Use

  • Real-time, low-latency pipelines where extra evaluation calls would block operations.
  • Very small or cost-sensitive deployments where multiple LLM calls are unaffordable.
  • Use cases that only need single-turn scoring with no multi-step reasoning.

Failure Modes

  • Variance in continuous-valued judgments leading to inconsistent numeric scores.
  • High runtime cost or latency if evaluation is not budgeted or cached.
  • Plan generation may miss required evaluators if function database is incomplete.

Core Entities

Models

  • GPT-4o

Metrics

  • Schema validity (fraction of valid fields)
  • Agent selection F1
  • Step-agent coherence
  • Order preservation (Kendall-style inversion)
  • Step efficiency (parsimony)
  • Aggregated AHP-weighted final score

Datasets

  • Finance invoice workflow (good-quality images)
  • Finance invoice workflow (degraded/blurred images)

Context Entities

Models

  • GPT-4o (used for Plan Generator and Plan Evaluator)

Metrics

  • Mean absolute error to human per step
  • Score dispersion statistics (IQR, whiskers)

Datasets

  • Internal evaluation records / exemplars (retrieved via ChromaDB)

Benchmarks

  • Human-as-a-judge reference scores (used as ground truth in experiments)