A modular agent-based judge that checks step-by-step agent reasoning to better match human task-success labels

August 7, 20256 min

Overview

Production Readiness

0.5

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

0

Authors

Roshita Bhonsle, Rishav Dutta, Sneha Vavilapalli, Harsh Seth, Abubakarr Jaye, Yapei Chang, Mukund Rungta, Emmanuel Aboah Boateng, Sadid Hasan, Ehi Nosakhare, Soundar Srinivasan

Links

Abstract / PDF

Why It Matters For Business

A log-aware, checklist-based judge can reduce human review by producing verdicts that better match human labels, especially for multi-step code tasks, lowering evaluation cost and speeding agent deployment decisions.

Summary TLDR

The paper introduces a modular Agent-as-a-Judge system that evaluates agent task completion by generating checklist-style criteria, extracting evidence from agent logs, and verifying each step with specialized handlers. Tested on GAIA and BigCodeBench, the Judge (v3) aligns better with human verdicts than a GPT-4o LLM-as-a-Judge baseline (≈+4.8% on GAIA, ≈+10.5% on BigCodeBench). Limitations: text-only tasks, single-log input, and sensitivity to misleading or opinionated log content.

Problem Statement

Human evaluation of agent task completion is costly and slow. Existing LLM-as-a-Judge methods check only final outputs and miss intermediate reasoning. The paper asks how to build a general, domain-agnostic judge that assesses step-by-step agent behavior and improves alignment with human judgments.

Main Contribution

A domain-agnostic, modular Judge framework that evaluates agent task completion step-by-step using checklist questions tied to log evidence.

Design and implementation of four main modules: Criteria Generator, Artifact Content Parser, Criteria Check Composer (C3), and Verdict Generator.

Empirical evaluation on GAIA and BigCodeBench showing improved alignment with human labels over a GPT-4o LLM-as-a-Judge baseline.

Key Findings

Judge v3 improves agreement with human labels on GAIA versus GPT-4o baseline.

NumbersAccuracy: 61.90% vs 57.14% (+4.76%)

Judge v3 shows larger alignment gain on code tasks in BigCodeBench.

NumbersAccuracy: 73.68% vs 63.16% (+10.52%)

Judge v3 achieves much higher precision on BigCodeBench compared to baseline.

NumbersPrecision: 92.31% vs 76.47% (+15.84%)

Results

Accuracy

Value61.90% (Our-Judge v3)

Baseline57.14% (LLM-as-a-Judge)

Accuracy

Value73.68% (Our-Judge v3)

Baseline63.16% (LLM-as-a-Judge)

Precision (BigCodeBench)

Value92.31% (Our-Judge v3)

Baseline76.47% (LLM-as-a-Judge)

Recall (BigCodeBench)

Value92.30% (Our-Judge v3)

Baseline100.00% (LLM-as-a-Judge)

Who Should Care

What To Try In 7 Days

Pipe agent run logs into a simple checklist generator to verify explicit task requirements.

Index and summarize long agent logs in 300-token chunks for targeted retrieval.

Compare a log-aware judge verdicts against a final-output-only LLM baseline on a small task sample to measure alignment gains.

Agent Features

Memory

  • Retrieval memory via chunked indices

Planning

  • Decision-tree verification plans
  • Task-conditioned verification trajectories

Tool Use

  • Web surf / retrieval
  • Code execution environment
  • LoRA

Frameworks

  • Magentic-One
  • RAG-inspired indexer/retriever

Is Agentic

true

Architectures

  • Modular multi-agent verification
  • Planner-orchestrator-worker pattern

Collaboration

  • Multi-agent orchestration (planner + workers)

Optimization Features

Token Efficiency

  • Chunking logs into 300-token summaries to reduce context usage

System Optimization

  • LLM-based filtering to remove redundant checklist items

Reproducibility

Data Urls

  • GAIA (public)
  • BigCodeBench (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Supports only text-based tasks; multimodal and file/attachment checks are not handled.
  • Artifact Content Parser accepts a single log file; multiple outputs or artifacts are unsupported.
  • Judge can over-trust or misread actor logs and may accept actor-provided proofs as ground truth.
  • Content parser can inject opinions or extract proofs from actor plans, causing incorrect verdicts.

When Not To Use

  • Tasks that include images, audio, or other non-textual artifacts.
  • Workflows producing multiple disparate artifacts or separate logs.
  • Scenarios where the judge must independently solve the task rather than verify the actor.

Failure Modes

  • Over-reliance on actor logs leading to false positives when logs claim but did not perform actions.
  • Confusing fictional or role-play instructions with real actions, yielding irrelevant checklist items.
  • Parser output injects subjective language that can mislead verification modules.
  • Conservative verification that produces false negatives on some true positives.

Core Entities

Models

  • GPT-4o
  • Magentic-One
  • Qwen 2.5
  • Llama 3.1
  • Llama 3.2

Metrics

  • Accuracy
  • Precision
  • Recall
  • Specificity
  • Human alignment

Datasets

  • GAIA
  • BigCodeBench

Benchmarks

  • GAIA
  • BigCodeBench

Context Entities

Models

  • Compass-Judger-1
  • Prometheus
  • AutoArena
  • ChatEval

Metrics

  • Human alignment (confusion matrix)

Datasets

  • MT-Bench (related)
  • Chatbot Arena (related)

Benchmarks

  • MT-Bench