Prometheus 2: an open evaluator LM that handles both scoring and pairwise comparisons and closes the gap to GPT-4

Overview

Decision SnapshotReady For Pilot

Solid experimental evidence shows Prometheus 2 improves open-model agreement with humans and proprietary judges across many benchmarks; results are reproducible and code/models are released, though dataset licensing is restricted.

Citations6

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

License: Apache-2.0

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Prometheus 2 provides an open, lower-cost evaluator that better matches human and proprietary-LLM judgments and supports custom criteria—useful to automate model QA, reduce evaluation costs, and avoid vendor lock-in.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

Prometheus 2 is an open-source evaluator language model trained to do both direct scoring (1–5 scale) and pairwise A/B ranking under user-defined criteria. It uses a new pairwise dataset (PREFERENCE COLLECTION, 200k instances) plus an existing direct-feedback dataset, and merges weights from format-specific fine-tuned models. Prometheus 2 (7B & 8x7B) yields the best correlations and human agreement among open evaluator LMs on eight benchmarks, reduces the gap to GPT-4 substantially, and is released with code and models under Apache-2.0 (dataset subject to OpenAI terms).

Problem Statement

Proprietary LMs (e.g., GPT-4) are commonly used as automatic judges, but they are expensive, opaque, and hard to control. Existing open evaluator models either score poorly against humans/GPT-4 or only support one evaluation format (direct scoring or pairwise). The paper asks: can an open LM match proprietary judges across both formats and custom criteria?

Main Contribution

Prometheus 2 (7B & 8x7B): open evaluator LMs that handle both direct assessment and pairwise ranking.

PREFERENCE COLLECTION: a new pairwise ranking dataset with 1,000 instance-wise evaluation criteria and 200k pairwise instances.

Key Findings

Prometheus 2 gives the highest correlation with humans and proprietary LM judges among tested open evaluator LMs.

NumbersPearson up to 0.685 vs prior open baselines ~0.48 (Vicuna/MT/FLASK averages)

Practical UseUse Prometheus 2 instead of earlier open evaluators when you need model-based judgments that better match human/GPT-4 scores.

Evidence RefTable 3, Table 5

Prometheus-2-8X7B halves the gap to GPT-4 on human correlation for the FLASK benchmark.

NumbersHuman–GPT-4 Pearson 0.679; Prometheus-13B 0.449 -> Prometheus-2-8x7B 0.555

Practical UseIf you currently rely on GPT-4 for quality evaluation, Prometheus-2-8x7B is a viable lower-cost open alternative that closely tracks human judgments on many tasks.

Evidence RefSection 5.1, Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Direct assessment Pearson (vs GPT-4/GPT judge)	0.685 (Prometheus-2-8x7B on Vicuna/avg)	Mixtral-8x7B ~0.566	+0.12	Vicuna/MT/FLASK averages	Table 3, Table 5	Table 3
Correlation with humans (FLASK)	0.555 (Prometheus-2-8x7B)	Prometheus-13B 0.449; GPT-4 0.679	halves gap vs previous open model	FLASK	Section 5.1, Table 3	Table 3

What To Try In 7 Days

Run Prometheus-2-7B on your current evaluation prompts and compare correlation with human labels.

If you use pairwise A/B tests, fine-tune a small evaluator on your criteria and merge with Prometheus 2 weights.

Add reference answers to your evaluation pipeline to improve score reliability with Prometheus 2.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseApache-2.0

Code URLs

https://github.com/prometheus-eval/prometheus-eval

Data URLs

https://github.com/prometheus-eval/prometheus-eval

Risks & Boundaries

Limitations

Supports only 1–5 Likert direct scoring and binary/paired ranking; not multi-item ranking or checklist formats.

PREFERENCE COLLECTION includes GPT-4 generated feedback and is subject to OpenAI Terms of Use.

When Not To Use

When you need evaluation formats beyond 1–5 scoring or pairwise A/B without further adaptation.

If your evaluation data cannot be shared due to licensing that conflicts with the dataset's terms.

Failure Modes

Bias toward training judge styles: model mimics patterns present in FEEDBACK/PREFERENCE collections.

Performance drops when no reference answer is available (reference-free evaluations are weaker).

Core Entities

Models

Prometheus-2-7BPrometheus-2-8x7BPrometheus-7BPrometheus-13BMistral-7B-InstructMixtral-8x7B-InstructGPT-4-1106Claude-3-OpusGPT-3.5-Turbo-0613Auto-JPairRMUltraRM

Metrics

PearsonSpearmanKendall-TauAccuracyKrippendorff's alphaTransitivity

Datasets

PREFERENCE COLLECTIONFEEDBACK COLLECTIONVicuna BenchMT BenchFLASKFeedback BenchHHH AlignmentMTBench Human JudgmentAuto-J EvalPreference BenchBiGGen Bench

Benchmarks

Vicuna BenchMT BenchFLASKFeedback BenchHHH AlignmentMTBench Human JudgmentAuto-J EvalPreference BenchBiGGen Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Prometheus 2 gives the highest correlation with humans and proprietary LM judges among tested open evaluator LMs.

Prometheus-2-8X7B halves the gap to GPT-4 on human correlation for the FLASK benchmark.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding