AUTO-J: a 13B open-source judge that scores LLM outputs across 58 real-world scenarios and writes critiques

October 9, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

5

Authors

Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu

Links

Abstract / PDF

Why It Matters For Business

AUTO-J offers a reusable, lower-cost, and reproducible judgment engine for internal model comparisons and automated selection, reducing dependence on expensive closed APIs while giving readable critiques teams can act on.

Summary TLDR

AUTO-J is a 13B parameter open-source evaluator fine-tuned to judge large language model outputs. Trained on model responses and GPT-4 judgments across 58 real-world scenarios, it supports pairwise comparison, single-response critique, and numeric ratings. On the authors' meta-tests it matches or beats many open-source baselines and narrows the gap to proprietary judges on system-level ranking. Code, data, and prompts are released.

Problem Statement

Off-the-shelf automatic metrics and ad-hoc human labels don't scale to diverse, real-world alignment evaluation. Teams need an evaluator that (1) works across many user scenarios without gold references, (2) supports multiple evaluation protocols (pairwise, single-response, scalar rating), and (3) returns readable critiques so humans can inspect and act on judgments.

Main Contribution

AUTO-J: a 13B generative judge trained to produce pairwise decisions, single-response critiques, and scalar ratings with human-readable explanations

A new judgment dataset built from 58 real-world scenarios with 332 curated scenario criteria and mixed GPT-4 / human labels

Open release of model, scenario typology, prompts, and judgments for reproducible evaluation

Key Findings

AUTO-J achieves state-of-the-art agreement among open-source judges on a 58-scenario pairwise benchmark.

Numbers8.9% relative improvement (pairwise vs opensource baselines)

AUTO-J outperforms ChatGPT and Claude-2 on the authors' pairwise test by double-digit margins.

Numbers+12.1% and +12.4% (pairwise agreement vs ChatGPT/Claude-2)

AUTO-J correlates reasonably with GPT-4 on Best-of-N selection and strongly with GPT-4 on system-level rankings.

NumbersPearson 0.57 / Spearman 0.55 (Best-ofN); Spearman 0.97 (system-level)

Training data scale: pairwise 3,436 labeled pairs; single-response 960 samples; criteria set = 332 items; scenarios = 58.

Numbers3,436 pairwise; 960 single-response; 332 criteria; 58 scenarios

Results

Pairwise agreement (AUTO-J, overall, Eval-P)

Value55.0%

Baselinevarious open-source baselines

Pairwise agreement (GPT-4, overall, Eval-P)

Value62.3%

Best-ofN selection correlation with GPT-4 (AUTO-J)

ValuePearson 0.57 / Spearman 0.55

Baselineother rating models

System-level ranking correlation (AlpacaEval)

ValueSpearman 0.97 / Pearson 0.96

BaselineGPT-4 ranking

Who Should Care

What To Try In 7 Days

Run AUTO-J on your existing model outputs for a small slice of workflows (e.g., emails, customer replies) to compare against your human labels

Use AUTO-J ratings to automate a Best-ofN selection pipeline and compare top selections to your current metric

Inspect AUTO-J critiques on 50 samples to find common failure modes and prioritize quick policy or prompt fixes

Optimization Features

Infra Optimization

  • trained on 8x NVIDIA A100 GPUs with DeepSpeed

System Optimization

  • input format design omits scenario criteria (context distillation style) to improve generality

Training Optimization

  • ZeRO Stage 3 (DeepSpeed)
  • gradient-checkpointing
  • FlashAttention
  • mixed BF16/TF32 precision
  • AdamW optimizer (lr schedule with cosine decay)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Training labels rely heavily on GPT-4 judgments; that can propagate GPT-4's blind spots and biases.
  • AUTO-J's response-level agreement still lags best proprietary judges (e.g., GPT-4: 62.3% vs AUTO-J 55.0 on Eval-P).
  • Dataset excludes non-English samples and truncates multi-turn dialogues to first turn, limiting some dialog contexts.
  • Authors filter or discard GPT-4 outputs inconsistent with human labels, which may bias the training distribution.

When Not To Use

  • For single-sample, high-stakes safety decisions without human review — AUTO-J should be an aid, not the final arbiter.
  • When you need evaluation tightly aligned to a niche legal/regulatory standard not represented in the 58 scenarios.

Failure Modes

  • Inherited judge bias: reproduces GPT-4 preference patterns in training data.
  • Overconfidence on tasks outside the 58 trained scenarios.
  • Potential collapse to paraphrasing scenario criteria if trained with criteria in input (authors avoided this).

Core Entities

Models

  • AUTO-J (13B)
  • LLaMA-2-13B-chat
  • LLaMA-2-chat-70B
  • GPT-4
  • ChatGPT (gpt-3.5-turbo)
  • Claude-2
  • PandaLM
  • SelFee
  • SteamSHP
  • Vicuna-13B
  • WizardLM-13B

Metrics

  • agreement rate (pairwise)
  • win-rate (critique comparisons)
  • Pearson correlation (ratings)
  • Spearman correlation (ratings/ranking)
  • consistency when swapping response order
  • Best-ofN average GPT-4 rating

Datasets

  • Chatbot Arena Conversations
  • MTBench
  • OpenAI Summary
  • OpenAI WebGPT
  • Stanford SHP
  • Synthetic GPT-J
  • PKU-SafeRLHF
  • AlpacaEval

Benchmarks

  • Eval-P (pairwise, 1,392 test samples)
  • Eval-C (single-response critiques, 232 samples)
  • Eval-R (rating/Best-ofN tests, 3,712 pairs per base LLM)
  • AlpacaEval (system-level)