AUTO-J: a 13B open-source judge that scores LLM outputs across 58 real-world scenarios and writes critiques

Overview

Decision SnapshotReady For Pilot

AUTO-J is ready for evaluation pipelines and internal selection tasks; expect remaining gaps vs GPT-4 at sample-level and plan a human check for high-stakes outputs.

Citations5

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AUTO-J offers a reusable, lower-cost, and reproducible judgment engine for internal model comparisons and automated selection, reducing dependence on expensive closed APIs while giving readable critiques teams can act on.

Who Should Care

ML Engineer Product Manager Data Scientist CTO

Summary TLDR

AUTO-J is a 13B parameter open-source evaluator fine-tuned to judge large language model outputs. Trained on model responses and GPT-4 judgments across 58 real-world scenarios, it supports pairwise comparison, single-response critique, and numeric ratings. On the authors' meta-tests it matches or beats many open-source baselines and narrows the gap to proprietary judges on system-level ranking. Code, data, and prompts are released.

Problem Statement

Off-the-shelf automatic metrics and ad-hoc human labels don't scale to diverse, real-world alignment evaluation. Teams need an evaluator that (1) works across many user scenarios without gold references, (2) supports multiple evaluation protocols (pairwise, single-response, scalar rating), and (3) returns readable critiques so humans can inspect and act on judgments.

Main Contribution

AUTO-J: a 13B generative judge trained to produce pairwise decisions, single-response critiques, and scalar ratings with human-readable explanations

A new judgment dataset built from 58 real-world scenarios with 332 curated scenario criteria and mixed GPT-4 / human labels

Key Findings

AUTO-J achieves state-of-the-art agreement among open-source judges on a 58-scenario pairwise benchmark.

Numbers8.9% relative improvement (pairwise vs opensource baselines)

Practical UseUse AUTO-J as a drop-in open-source judge when you need broad-scenario pairwise comparisons instead of calling closed APIs.

Evidence RefIntro / §6.1

AUTO-J outperforms ChatGPT and Claude-2 on the authors' pairwise test by double-digit margins.

Numbers+12.1% and +12.4% (pairwise agreement vs ChatGPT/Claude-2)

Practical UseFor internal A/B work where API cost or reproducibility is a concern, AUTO-J can replace some closed-model evaluations while improving agreement on these tests.

Evidence RefContribution (i) / Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pairwise agreement (AUTO-J, overall, Eval-P)	55.0%	various open-source baselines	≈+8.9% vs opensource SOTA (paper claim)	Eval-P (58 scenarios)	Table 1 shows 55.0% overall for AUTO-J	Table 1
Pairwise agreement (GPT-4, overall, Eval-P)	62.3%	—	—	Eval-P	Table 1 reports GPT-4 62.3% overall	Table 1

What To Try In 7 Days

Run AUTO-J on your existing model outputs for a small slice of workflows (e.g., emails, customer replies) to compare against your human labels

Use AUTO-J ratings to automate a Best-ofN selection pipeline and compare top selections to your current metric

Inspect AUTO-J critiques on 50 samples to find common failure modes and prioritize quick policy or prompt fixes

Optimization Features

Infra Optimization

trained on 8x NVIDIA A100 GPUs with DeepSpeed

System Optimization

input format design omits scenario criteria (context distillation style) to improve generality

Training Optimization

ZeRO Stage 3 (DeepSpeed)gradient-checkpointingFlashAttentionmixed BF16/TF32 precisionAdamW optimizer (lr schedule with cosine decay)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/GAIR-NLP/auto-j

Data URLs

https://github.com/GAIR-NLP/auto-j

Risks & Boundaries

Limitations

Training labels rely heavily on GPT-4 judgments; that can propagate GPT-4's blind spots and biases.

AUTO-J's response-level agreement still lags best proprietary judges (e.g., GPT-4: 62.3% vs AUTO-J 55.0 on Eval-P).

When Not To Use

For single-sample, high-stakes safety decisions without human review — AUTO-J should be an aid, not the final arbiter.

When you need evaluation tightly aligned to a niche legal/regulatory standard not represented in the 58 scenarios.

Failure Modes

Inherited judge bias: reproduces GPT-4 preference patterns in training data.

Overconfidence on tasks outside the 58 trained scenarios.

Core Entities

Models

AUTO-J (13B)LLaMA-2-13B-chatLLaMA-2-chat-70BGPT-4ChatGPT (gpt-3.5-turbo)Claude-2PandaLMSelFeeSteamSHPVicuna-13BWizardLM-13B

Metrics

agreement rate (pairwise)win-rate (critique comparisons)Pearson correlation (ratings)Spearman correlation (ratings/ranking)consistency when swapping response orderBest-ofN average GPT-4 rating

Datasets

Chatbot Arena ConversationsMTBenchOpenAI SummaryOpenAI WebGPTStanford SHPSynthetic GPT-JPKU-SafeRLHFAlpacaEval

Benchmarks

Eval-P (pairwise, 1,392 test samples)Eval-C (single-response critiques, 232 samples)Eval-R (rating/Best-ofN tests, 3,712 pairs per base LLM)AlpacaEval (system-level)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AUTO-J achieves state-of-the-art agreement among open-source judges on a 58-scenario pairwise benchmark.

AUTO-J outperforms ChatGPT and Claude-2 on the authors' pairwise test by double-digit margins.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding