A modular agent-based judge that checks step-by-step agent reasoning to better match human task-success labels

August 7, 20256 min

Overview

Decision SnapshotNeeds Validation

The design and experiments show consistent alignment gains on two public text/code benchmarks, but scope is limited to text logs and single-file evidence, so production readiness is moderate.

Citations0

Evidence Strength0.60

Confidence0.70

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 50%

Authors

Roshita Bhonsle, Rishav Dutta, Sneha Vavilapalli, Harsh Seth, Abubakarr Jaye, Yapei Chang, Mukund Rungta, Emmanuel Aboah Boateng, Sadid Hasan, Ehi Nosakhare, Soundar Srinivasan

Links

Abstract / PDF / Data

Why It Matters For Business

A log-aware, checklist-based judge can reduce human review by producing verdicts that better match human labels, especially for multi-step code tasks, lowering evaluation cost and speeding agent deployment decisions.

Who Should Care

Summary TLDR

The paper introduces a modular Agent-as-a-Judge system that evaluates agent task completion by generating checklist-style criteria, extracting evidence from agent logs, and verifying each step with specialized handlers. Tested on GAIA and BigCodeBench, the Judge (v3) aligns better with human verdicts than a GPT-4o LLM-as-a-Judge baseline (≈+4.8% on GAIA, ≈+10.5% on BigCodeBench). Limitations: text-only tasks, single-log input, and sensitivity to misleading or opinionated log content.

Problem Statement

Human evaluation of agent task completion is costly and slow. Existing LLM-as-a-Judge methods check only final outputs and miss intermediate reasoning. The paper asks how to build a general, domain-agnostic judge that assesses step-by-step agent behavior and improves alignment with human judgments.

Main Contribution

A domain-agnostic, modular Judge framework that evaluates agent task completion step-by-step using checklist questions tied to log evidence.

Design and implementation of four main modules: Criteria Generator, Artifact Content Parser, Criteria Check Composer (C3), and Verdict Generator.

Key Findings

Judge v3 improves agreement with human labels on GAIA versus GPT-4o baseline.

NumbersAccuracy: 61.90% vs 57.14% (+4.76%)

Practical UseUse step-wise log-aware judging to get modest but consistent gains in matching human task-success labels on text tasks.

Evidence RefTable 1

Judge v3 shows larger alignment gain on code tasks in BigCodeBench.

NumbersAccuracy: 73.68% vs 63.16% (+10.52%)

Practical UseFor code-heavy, multi-step tasks, a modular judge that inspects intermediate steps can substantially improve human alignment.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy61.90% (Our-Judge v3)57.14% (LLM-as-a-Judge)+4.76%GAIA (21 pass / 21 fail)Table 1 LLM vs Our-JudgeTable 1
Accuracy73.68% (Our-Judge v3)63.16% (LLM-as-a-Judge)+10.52%BigCodeBench (28 pass / 10 fail)Table 1 LLM vs Our-JudgeTable 1

What To Try In 7 Days

Pipe agent run logs into a simple checklist generator to verify explicit task requirements.

Index and summarize long agent logs in 300-token chunks for targeted retrieval.

Compare a log-aware judge verdicts against a final-output-only LLM baseline on a small task sample to measure alignment gains.

Agent Features

Memory
Retrieval memory via chunked indices
Planning
Decision-tree verification plansTask-conditioned verification trajectories
Tool Use
Web surf / retrievalCode execution environmentLoRA
Frameworks
Magentic-OneRAG-inspired indexer/retriever
Is Agentic

Yes

Architectures
Modular multi-agent verificationPlanner-orchestrator-worker pattern
Collaboration
Multi-agent orchestration (planner + workers)

Optimization Features

Token Efficiency
Chunking logs into 300-token summaries to reduce context usage
System Optimization
LLM-based filtering to remove redundant checklist items

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

GAIA (public)BigCodeBench (public)

Risks & Boundaries

Limitations

Supports only text-based tasks; multimodal and file/attachment checks are not handled.

Artifact Content Parser accepts a single log file; multiple outputs or artifacts are unsupported.

When Not To Use

Tasks that include images, audio, or other non-textual artifacts.

Workflows producing multiple disparate artifacts or separate logs.

Failure Modes

Over-reliance on actor logs leading to false positives when logs claim but did not perform actions.

Confusing fictional or role-play instructions with real actions, yielding irrelevant checklist items.

Core Entities

Models

GPT-4oMagentic-OneQwen 2.5Llama 3.1Llama 3.2

Metrics

AccuracyPrecisionRecallSpecificityHuman alignment

Datasets

GAIABigCodeBench

Benchmarks

GAIABigCodeBench

Context Entities

Models

Compass-Judger-1PrometheusAutoArenaChatEval

Metrics

Human alignment (confusion matrix)

Datasets

MT-Bench (related)Chatbot Arena (related)

Benchmarks

MT-Bench