Prometheus 2: an open evaluator LM that handles both scoring and pairwise comparisons and closes the gap to GPT-4

May 2, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

6

Authors

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo

Links

Abstract / PDF

Why It Matters For Business

Prometheus 2 provides an open, lower-cost evaluator that better matches human and proprietary-LLM judgments and supports custom criteria—useful to automate model QA, reduce evaluation costs, and avoid vendor lock-in.

Summary TLDR

Prometheus 2 is an open-source evaluator language model trained to do both direct scoring (1–5 scale) and pairwise A/B ranking under user-defined criteria. It uses a new pairwise dataset (PREFERENCE COLLECTION, 200k instances) plus an existing direct-feedback dataset, and merges weights from format-specific fine-tuned models. Prometheus 2 (7B & 8x7B) yields the best correlations and human agreement among open evaluator LMs on eight benchmarks, reduces the gap to GPT-4 substantially, and is released with code and models under Apache-2.0 (dataset subject to OpenAI terms).

Problem Statement

Proprietary LMs (e.g., GPT-4) are commonly used as automatic judges, but they are expensive, opaque, and hard to control. Existing open evaluator models either score poorly against humans/GPT-4 or only support one evaluation format (direct scoring or pairwise). The paper asks: can an open LM match proprietary judges across both formats and custom criteria?

Main Contribution

Prometheus 2 (7B & 8x7B): open evaluator LMs that handle both direct assessment and pairwise ranking.

PREFERENCE COLLECTION: a new pairwise ranking dataset with 1,000 instance-wise evaluation criteria and 200k pairwise instances.

A practical recipe: fine-tune separate models on each format and merge weights (DARE-Linear variant) to get a unified evaluator that outperforms joint training and other baselines.

Key Findings

Prometheus 2 gives the highest correlation with humans and proprietary LM judges among tested open evaluator LMs.

NumbersPearson up to 0.685 vs prior open baselines ~0.48 (Vicuna/MT/FLASK averages)

Prometheus-2-8X7B halves the gap to GPT-4 on human correlation for the FLASK benchmark.

NumbersHuman–GPT-4 Pearson 0.679; Prometheus-13B 0.449 -> Prometheus-2-8x7B 0.555

Merging weights from a direct-assessment model and a pairwise model outperforms joint training and simple ensembling.

NumbersWeight merging avg Pearson 0.624 vs joint training 0.485 (direct benchmarks); pairwise avg accuracy 79.15 vs lower for其它

The PREFERENCE COLLECTION provides fine-grained pairwise criteria and scale.

Numbers1,000 evaluation criteria; PREFERENCE COLLECTION size 200k instances; FEEDBACK COLLECTION 100k instances

Including a reference answer improves evaluator performance.

NumbersReference-based Pearson uplift up to +0.144 for Prometheus-2 (FLASK)

Prometheus 2 maintains strong consistency and transitivity in judgments.

NumbersTransitivity: Prometheus-2-7B 97.6, Prometheus-2-8X7B 96.75; Krippendorff's alpha higher for larger model

Results

Direct assessment Pearson (vs GPT-4/GPT judge)

Value0.685 (Prometheus-2-8x7B on Vicuna/avg)

BaselineMixtral-8x7B ~0.566

Correlation with humans (FLASK)

Value0.555 (Prometheus-2-8x7B)

BaselinePrometheus-13B 0.449; GPT-4 0.679

Accuracy

Value79.15 (Prometheus-2-8x7B average across pairwise sets)

BaselineMixtral prompting ~74.56; proprietary GPT-4 ~90.95

Training compute

Value≈800 GPU hours (8x A100 40GB)

Who Should Care

What To Try In 7 Days

Run Prometheus-2-7B on your current evaluation prompts and compare correlation with human labels.

If you use pairwise A/B tests, fine-tune a small evaluator on your criteria and merge with Prometheus 2 weights.

Add reference answers to your evaluation pipeline to improve score reliability with Prometheus 2.

Reproducibility

License

  • Apache-2.0

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Supports only 1–5 Likert direct scoring and binary/paired ranking; not multi-item ranking or checklist formats.
  • PREFERENCE COLLECTION includes GPT-4 generated feedback and is subject to OpenAI Terms of Use.
  • Weight-merging effectiveness is empirical; the paper does not provide a full theoretical explanation.

When Not To Use

  • When you need evaluation formats beyond 1–5 scoring or pairwise A/B without further adaptation.
  • If your evaluation data cannot be shared due to licensing that conflicts with the dataset's terms.
  • If you require the absolute highest agreement with GPT-4 on specific niche tasks; proprietary models may still outperform.

Failure Modes

  • Bias toward training judge styles: model mimics patterns present in FEEDBACK/PREFERENCE collections.
  • Performance drops when no reference answer is available (reference-free evaluations are weaker).
  • Merged model may inherit contradictory behaviors if training data for formats conflicts heavily.

Core Entities

Models

  • Prometheus-2-7B
  • Prometheus-2-8x7B
  • Prometheus-7B
  • Prometheus-13B
  • Mistral-7B-Instruct
  • Mixtral-8x7B-Instruct
  • GPT-4-1106
  • Claude-3-Opus
  • GPT-3.5-Turbo-0613
  • Auto-J
  • PairRM
  • UltraRM

Metrics

  • Pearson
  • Spearman
  • Kendall-Tau
  • Accuracy
  • Krippendorff's alpha
  • Transitivity

Datasets

  • PREFERENCE COLLECTION
  • FEEDBACK COLLECTION
  • Vicuna Bench
  • MT Bench
  • FLASK
  • Feedback Bench
  • HHH Alignment
  • MTBench Human Judgment
  • Auto-J Eval
  • Preference Bench
  • BiGGen Bench

Benchmarks

  • Vicuna Bench
  • MT Bench
  • FLASK
  • Feedback Bench
  • HHH Alignment
  • MTBench Human Judgment
  • Auto-J Eval
  • Preference Bench
  • BiGGen Bench