AUTO-J: a 13B open-source judge that scores LLM outputs across 58 real-world scenarios and writes critiques

October 9, 20237 min

Overview

Decision SnapshotReady For Pilot

AUTO-J is ready for evaluation pipelines and internal selection tasks; expect remaining gaps vs GPT-4 at sample-level and plan a human check for high-stakes outputs.

Citations5

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AUTO-J offers a reusable, lower-cost, and reproducible judgment engine for internal model comparisons and automated selection, reducing dependence on expensive closed APIs while giving readable critiques teams can act on.

Who Should Care

Summary TLDR

AUTO-J is a 13B parameter open-source evaluator fine-tuned to judge large language model outputs. Trained on model responses and GPT-4 judgments across 58 real-world scenarios, it supports pairwise comparison, single-response critique, and numeric ratings. On the authors' meta-tests it matches or beats many open-source baselines and narrows the gap to proprietary judges on system-level ranking. Code, data, and prompts are released.

Problem Statement

Off-the-shelf automatic metrics and ad-hoc human labels don't scale to diverse, real-world alignment evaluation. Teams need an evaluator that (1) works across many user scenarios without gold references, (2) supports multiple evaluation protocols (pairwise, single-response, scalar rating), and (3) returns readable critiques so humans can inspect and act on judgments.

Main Contribution

AUTO-J: a 13B generative judge trained to produce pairwise decisions, single-response critiques, and scalar ratings with human-readable explanations

A new judgment dataset built from 58 real-world scenarios with 332 curated scenario criteria and mixed GPT-4 / human labels

Key Findings

AUTO-J achieves state-of-the-art agreement among open-source judges on a 58-scenario pairwise benchmark.

Numbers8.9% relative improvement (pairwise vs opensource baselines)

Practical UseUse AUTO-J as a drop-in open-source judge when you need broad-scenario pairwise comparisons instead of calling closed APIs.

Evidence RefIntro / §6.1

AUTO-J outperforms ChatGPT and Claude-2 on the authors' pairwise test by double-digit margins.

Numbers+12.1% and +12.4% (pairwise agreement vs ChatGPT/Claude-2)

Practical UseFor internal A/B work where API cost or reproducibility is a concern, AUTO-J can replace some closed-model evaluations while improving agreement on these tests.

Evidence RefContribution (i) / Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pairwise agreement (AUTO-J, overall, Eval-P)55.0%various open-source baselines≈+8.9% vs opensource SOTA (paper claim)Eval-P (58 scenarios)Table 1 shows 55.0% overall for AUTO-JTable 1
Pairwise agreement (GPT-4, overall, Eval-P)62.3%Eval-PTable 1 reports GPT-4 62.3% overallTable 1

What To Try In 7 Days

Run AUTO-J on your existing model outputs for a small slice of workflows (e.g., emails, customer replies) to compare against your human labels

Use AUTO-J ratings to automate a Best-ofN selection pipeline and compare top selections to your current metric

Inspect AUTO-J critiques on 50 samples to find common failure modes and prioritize quick policy or prompt fixes

Optimization Features

Infra Optimization
trained on 8x NVIDIA A100 GPUs with DeepSpeed
System Optimization
input format design omits scenario criteria (context distillation style) to improve generality
Training Optimization
ZeRO Stage 3 (DeepSpeed)gradient-checkpointingFlashAttentionmixed BF16/TF32 precisionAdamW optimizer (lr schedule with cosine decay)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Training labels rely heavily on GPT-4 judgments; that can propagate GPT-4's blind spots and biases.

AUTO-J's response-level agreement still lags best proprietary judges (e.g., GPT-4: 62.3% vs AUTO-J 55.0 on Eval-P).

When Not To Use

For single-sample, high-stakes safety decisions without human review — AUTO-J should be an aid, not the final arbiter.

When you need evaluation tightly aligned to a niche legal/regulatory standard not represented in the 58 scenarios.

Failure Modes

Inherited judge bias: reproduces GPT-4 preference patterns in training data.

Overconfidence on tasks outside the 58 trained scenarios.

Core Entities

Models

AUTO-J (13B)LLaMA-2-13B-chatLLaMA-2-chat-70BGPT-4ChatGPT (gpt-3.5-turbo)Claude-2PandaLMSelFeeSteamSHPVicuna-13BWizardLM-13B

Metrics

agreement rate (pairwise)win-rate (critique comparisons)Pearson correlation (ratings)Spearman correlation (ratings/ranking)consistency when swapping response orderBest-ofN average GPT-4 rating

Datasets

Chatbot Arena ConversationsMTBenchOpenAI SummaryOpenAI WebGPTStanford SHPSynthetic GPT-JPKU-SafeRLHFAlpacaEval

Benchmarks

Eval-P (pairwise, 1,392 test samples)Eval-C (single-response critiques, 232 samples)Eval-R (rating/Best-ofN tests, 3,712 pairs per base LLM)AlpacaEval (system-level)