A modular agent-based judge that checks step-by-step agent reasoning to better match human task-success labels

Overview

Decision SnapshotNeeds Validation

The design and experiments show consistent alignment gains on two public text/code benchmarks, but scope is limited to text logs and single-file evidence, so production readiness is moderate.

Citations0

Evidence Strength0.60

Confidence0.70

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 50%

Authors

Roshita Bhonsle, Rishav Dutta, Sneha Vavilapalli, Harsh Seth, Abubakarr Jaye, Yapei Chang, Mukund Rungta, Emmanuel Aboah Boateng, Sadid Hasan, Ehi Nosakhare, Soundar Srinivasan

Links

Abstract / PDF / Data

Why It Matters For Business

A log-aware, checklist-based judge can reduce human review by producing verdicts that better match human labels, especially for multi-step code tasks, lowering evaluation cost and speeding agent deployment decisions.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The paper introduces a modular Agent-as-a-Judge system that evaluates agent task completion by generating checklist-style criteria, extracting evidence from agent logs, and verifying each step with specialized handlers. Tested on GAIA and BigCodeBench, the Judge (v3) aligns better with human verdicts than a GPT-4o LLM-as-a-Judge baseline (≈+4.8% on GAIA, ≈+10.5% on BigCodeBench). Limitations: text-only tasks, single-log input, and sensitivity to misleading or opinionated log content.

Problem Statement

Human evaluation of agent task completion is costly and slow. Existing LLM-as-a-Judge methods check only final outputs and miss intermediate reasoning. The paper asks how to build a general, domain-agnostic judge that assesses step-by-step agent behavior and improves alignment with human judgments.

Main Contribution

A domain-agnostic, modular Judge framework that evaluates agent task completion step-by-step using checklist questions tied to log evidence.

Design and implementation of four main modules: Criteria Generator, Artifact Content Parser, Criteria Check Composer (C3), and Verdict Generator.

Key Findings

Judge v3 improves agreement with human labels on GAIA versus GPT-4o baseline.

NumbersAccuracy: 61.90% vs 57.14% (+4.76%)

Practical UseUse step-wise log-aware judging to get modest but consistent gains in matching human task-success labels on text tasks.

Evidence RefTable 1

Judge v3 shows larger alignment gain on code tasks in BigCodeBench.

NumbersAccuracy: 73.68% vs 63.16% (+10.52%)

Practical UseFor code-heavy, multi-step tasks, a modular judge that inspects intermediate steps can substantially improve human alignment.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	61.90% (Our-Judge v3)	57.14% (LLM-as-a-Judge)	+4.76%	GAIA (21 pass / 21 fail)	Table 1 LLM vs Our-Judge	Table 1
Accuracy	73.68% (Our-Judge v3)	63.16% (LLM-as-a-Judge)	+10.52%	BigCodeBench (28 pass / 10 fail)	Table 1 LLM vs Our-Judge	Table 1

What To Try In 7 Days

Pipe agent run logs into a simple checklist generator to verify explicit task requirements.

Index and summarize long agent logs in 300-token chunks for targeted retrieval.

Compare a log-aware judge verdicts against a final-output-only LLM baseline on a small task sample to measure alignment gains.

Agent Features

Memory

Retrieval memory via chunked indices

Planning

Decision-tree verification plansTask-conditioned verification trajectories

Tool Use

Web surf / retrievalCode execution environmentLoRA

Frameworks

Magentic-OneRAG-inspired indexer/retriever

Is Agentic

Yes

Architectures

Modular multi-agent verificationPlanner-orchestrator-worker pattern

Collaboration

Multi-agent orchestration (planner + workers)

Optimization Features

Token Efficiency

Chunking logs into 300-token summaries to reduce context usage

System Optimization

LLM-based filtering to remove redundant checklist items

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

GAIA (public)BigCodeBench (public)

Risks & Boundaries

Limitations

Supports only text-based tasks; multimodal and file/attachment checks are not handled.

Artifact Content Parser accepts a single log file; multiple outputs or artifacts are unsupported.

When Not To Use

Tasks that include images, audio, or other non-textual artifacts.

Workflows producing multiple disparate artifacts or separate logs.

Failure Modes

Over-reliance on actor logs leading to false positives when logs claim but did not perform actions.

Confusing fictional or role-play instructions with real actions, yielding irrelevant checklist items.

Core Entities

Models

GPT-4oMagentic-OneQwen 2.5Llama 3.1Llama 3.2

Metrics

AccuracyPrecisionRecallSpecificityHuman alignment

Datasets

GAIABigCodeBench

Benchmarks

GAIABigCodeBench

Context Entities

Models

Compass-Judger-1PrometheusAutoArenaChatEval

Metrics

Human alignment (confusion matrix)

Datasets

MT-Bench (related)Chatbot Arena (related)

Benchmarks

MT-Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Judge v3 improves agreement with human labels on GAIA versus GPT-4o baseline.

Judge v3 shows larger alignment gain on code tasks in BigCodeBench.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding