Overview
Production Readiness
0.5
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
A log-aware, checklist-based judge can reduce human review by producing verdicts that better match human labels, especially for multi-step code tasks, lowering evaluation cost and speeding agent deployment decisions.
Summary TLDR
The paper introduces a modular Agent-as-a-Judge system that evaluates agent task completion by generating checklist-style criteria, extracting evidence from agent logs, and verifying each step with specialized handlers. Tested on GAIA and BigCodeBench, the Judge (v3) aligns better with human verdicts than a GPT-4o LLM-as-a-Judge baseline (≈+4.8% on GAIA, ≈+10.5% on BigCodeBench). Limitations: text-only tasks, single-log input, and sensitivity to misleading or opinionated log content.
Problem Statement
Human evaluation of agent task completion is costly and slow. Existing LLM-as-a-Judge methods check only final outputs and miss intermediate reasoning. The paper asks how to build a general, domain-agnostic judge that assesses step-by-step agent behavior and improves alignment with human judgments.
Main Contribution
A domain-agnostic, modular Judge framework that evaluates agent task completion step-by-step using checklist questions tied to log evidence.
Design and implementation of four main modules: Criteria Generator, Artifact Content Parser, Criteria Check Composer (C3), and Verdict Generator.
Empirical evaluation on GAIA and BigCodeBench showing improved alignment with human labels over a GPT-4o LLM-as-a-Judge baseline.
Key Findings
Judge v3 improves agreement with human labels on GAIA versus GPT-4o baseline.
Judge v3 shows larger alignment gain on code tasks in BigCodeBench.
Judge v3 achieves much higher precision on BigCodeBench compared to baseline.
Results
Accuracy
Accuracy
Precision (BigCodeBench)
Recall (BigCodeBench)
Who Should Care
What To Try In 7 Days
Pipe agent run logs into a simple checklist generator to verify explicit task requirements.
Index and summarize long agent logs in 300-token chunks for targeted retrieval.
Compare a log-aware judge verdicts against a final-output-only LLM baseline on a small task sample to measure alignment gains.
Agent Features
Memory
- Retrieval memory via chunked indices
Planning
- Decision-tree verification plans
- Task-conditioned verification trajectories
Tool Use
- Web surf / retrieval
- Code execution environment
- LoRA
Frameworks
- Magentic-One
- RAG-inspired indexer/retriever
Is Agentic
true
Architectures
- Modular multi-agent verification
- Planner-orchestrator-worker pattern
Collaboration
- Multi-agent orchestration (planner + workers)
Optimization Features
Token Efficiency
- Chunking logs into 300-token summaries to reduce context usage
System Optimization
- LLM-based filtering to remove redundant checklist items
Reproducibility
Data Urls
- GAIA (public)
- BigCodeBench (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Supports only text-based tasks; multimodal and file/attachment checks are not handled.
- Artifact Content Parser accepts a single log file; multiple outputs or artifacts are unsupported.
- Judge can over-trust or misread actor logs and may accept actor-provided proofs as ground truth.
- Content parser can inject opinions or extract proofs from actor plans, causing incorrect verdicts.
When Not To Use
- Tasks that include images, audio, or other non-textual artifacts.
- Workflows producing multiple disparate artifacts or separate logs.
- Scenarios where the judge must independently solve the task rather than verify the actor.
Failure Modes
- Over-reliance on actor logs leading to false positives when logs claim but did not perform actions.
- Confusing fictional or role-play instructions with real actions, yielding irrelevant checklist items.
- Parser output injects subjective language that can mislead verification modules.
- Conservative verification that produces false negatives on some true positives.
Core Entities
Models
- GPT-4o
- Magentic-One
- Qwen 2.5
- Llama 3.1
- Llama 3.2
Metrics
- Accuracy
- Precision
- Recall
- Specificity
- Human alignment
Datasets
- GAIA
- BigCodeBench
Benchmarks
- GAIA
- BigCodeBench
Context Entities
Models
- Compass-Judger-1
- Prometheus
- AutoArena
- ChatEval
Metrics
- Human alignment (confusion matrix)
Datasets
- MT-Bench (related)
- Chatbot Arena (related)
Benchmarks
- MT-Bench

