Overview
Solid experimental evidence shows Prometheus 2 improves open-model agreement with humans and proprietary judges across many benchmarks; results are reproducible and code/models are released, though dataset licensing is restricted.
Citations6
Evidence Strength0.80
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Partial
License: Apache-2.0
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
Prometheus 2 provides an open, lower-cost evaluator that better matches human and proprietary-LLM judgments and supports custom criteria—useful to automate model QA, reduce evaluation costs, and avoid vendor lock-in.
Who Should Care
Summary TLDR
Prometheus 2 is an open-source evaluator language model trained to do both direct scoring (1–5 scale) and pairwise A/B ranking under user-defined criteria. It uses a new pairwise dataset (PREFERENCE COLLECTION, 200k instances) plus an existing direct-feedback dataset, and merges weights from format-specific fine-tuned models. Prometheus 2 (7B & 8x7B) yields the best correlations and human agreement among open evaluator LMs on eight benchmarks, reduces the gap to GPT-4 substantially, and is released with code and models under Apache-2.0 (dataset subject to OpenAI terms).
Problem Statement
Proprietary LMs (e.g., GPT-4) are commonly used as automatic judges, but they are expensive, opaque, and hard to control. Existing open evaluator models either score poorly against humans/GPT-4 or only support one evaluation format (direct scoring or pairwise). The paper asks: can an open LM match proprietary judges across both formats and custom criteria?
Main Contribution
Prometheus 2 (7B & 8x7B): open evaluator LMs that handle both direct assessment and pairwise ranking.
PREFERENCE COLLECTION: a new pairwise ranking dataset with 1,000 instance-wise evaluation criteria and 200k pairwise instances.
Key Findings
Prometheus 2 gives the highest correlation with humans and proprietary LM judges among tested open evaluator LMs.
Prometheus-2-8X7B halves the gap to GPT-4 on human correlation for the FLASK benchmark.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Direct assessment Pearson (vs GPT-4/GPT judge) | 0.685 (Prometheus-2-8x7B on Vicuna/avg) | Mixtral-8x7B ~0.566 | +0.12 | Vicuna/MT/FLASK averages | Table 3, Table 5 | Table 3 |
| Correlation with humans (FLASK) | 0.555 (Prometheus-2-8x7B) | Prometheus-13B 0.449; GPT-4 0.679 | halves gap vs previous open model | FLASK | Section 5.1, Table 3 | Table 3 |
What To Try In 7 Days
Run Prometheus-2-7B on your current evaluation prompts and compare correlation with human labels.
If you use pairwise A/B tests, fine-tune a small evaluator on your criteria and merge with Prometheus 2 weights.
Add reference answers to your evaluation pipeline to improve score reliability with Prometheus 2.
Reproducibility
Risks & Boundaries
Limitations
Supports only 1–5 Likert direct scoring and binary/paired ranking; not multi-item ranking or checklist formats.
PREFERENCE COLLECTION includes GPT-4 generated feedback and is subject to OpenAI Terms of Use.
When Not To Use
When you need evaluation formats beyond 1–5 scoring or pairwise A/B without further adaptation.
If your evaluation data cannot be shared due to licensing that conflicts with the dataset's terms.
Failure Modes
Bias toward training judge styles: model mimics patterns present in FEEDBACK/PREFERENCE collections.
Performance drops when no reference answer is available (reference-free evaluations are weaker).

