Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
6
Why It Matters For Business
Prometheus 2 provides an open, lower-cost evaluator that better matches human and proprietary-LLM judgments and supports custom criteria—useful to automate model QA, reduce evaluation costs, and avoid vendor lock-in.
Summary TLDR
Prometheus 2 is an open-source evaluator language model trained to do both direct scoring (1–5 scale) and pairwise A/B ranking under user-defined criteria. It uses a new pairwise dataset (PREFERENCE COLLECTION, 200k instances) plus an existing direct-feedback dataset, and merges weights from format-specific fine-tuned models. Prometheus 2 (7B & 8x7B) yields the best correlations and human agreement among open evaluator LMs on eight benchmarks, reduces the gap to GPT-4 substantially, and is released with code and models under Apache-2.0 (dataset subject to OpenAI terms).
Problem Statement
Proprietary LMs (e.g., GPT-4) are commonly used as automatic judges, but they are expensive, opaque, and hard to control. Existing open evaluator models either score poorly against humans/GPT-4 or only support one evaluation format (direct scoring or pairwise). The paper asks: can an open LM match proprietary judges across both formats and custom criteria?
Main Contribution
Prometheus 2 (7B & 8x7B): open evaluator LMs that handle both direct assessment and pairwise ranking.
PREFERENCE COLLECTION: a new pairwise ranking dataset with 1,000 instance-wise evaluation criteria and 200k pairwise instances.
A practical recipe: fine-tune separate models on each format and merge weights (DARE-Linear variant) to get a unified evaluator that outperforms joint training and other baselines.
Key Findings
Prometheus 2 gives the highest correlation with humans and proprietary LM judges among tested open evaluator LMs.
Prometheus-2-8X7B halves the gap to GPT-4 on human correlation for the FLASK benchmark.
Merging weights from a direct-assessment model and a pairwise model outperforms joint training and simple ensembling.
The PREFERENCE COLLECTION provides fine-grained pairwise criteria and scale.
Including a reference answer improves evaluator performance.
Prometheus 2 maintains strong consistency and transitivity in judgments.
Results
Direct assessment Pearson (vs GPT-4/GPT judge)
Correlation with humans (FLASK)
Accuracy
Training compute
Who Should Care
What To Try In 7 Days
Run Prometheus-2-7B on your current evaluation prompts and compare correlation with human labels.
If you use pairwise A/B tests, fine-tune a small evaluator on your criteria and merge with Prometheus 2 weights.
Add reference answers to your evaluation pipeline to improve score reliability with Prometheus 2.
Reproducibility
License
- Apache-2.0
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Supports only 1–5 Likert direct scoring and binary/paired ranking; not multi-item ranking or checklist formats.
- PREFERENCE COLLECTION includes GPT-4 generated feedback and is subject to OpenAI Terms of Use.
- Weight-merging effectiveness is empirical; the paper does not provide a full theoretical explanation.
When Not To Use
- When you need evaluation formats beyond 1–5 scoring or pairwise A/B without further adaptation.
- If your evaluation data cannot be shared due to licensing that conflicts with the dataset's terms.
- If you require the absolute highest agreement with GPT-4 on specific niche tasks; proprietary models may still outperform.
Failure Modes
- Bias toward training judge styles: model mimics patterns present in FEEDBACK/PREFERENCE collections.
- Performance drops when no reference answer is available (reference-free evaluations are weaker).
- Merged model may inherit contradictory behaviors if training data for formats conflicts heavily.
Core Entities
Models
- Prometheus-2-7B
- Prometheus-2-8x7B
- Prometheus-7B
- Prometheus-13B
- Mistral-7B-Instruct
- Mixtral-8x7B-Instruct
- GPT-4-1106
- Claude-3-Opus
- GPT-3.5-Turbo-0613
- Auto-J
- PairRM
- UltraRM
Metrics
- Pearson
- Spearman
- Kendall-Tau
- Accuracy
- Krippendorff's alpha
- Transitivity
Datasets
- PREFERENCE COLLECTION
- FEEDBACK COLLECTION
- Vicuna Bench
- MT Bench
- FLASK
- Feedback Bench
- HHH Alignment
- MTBench Human Judgment
- Auto-J Eval
- Preference Bench
- BiGGen Bench
Benchmarks
- Vicuna Bench
- MT Bench
- FLASK
- Feedback Bench
- HHH Alignment
- MTBench Human Judgment
- Auto-J Eval
- Preference Bench
- BiGGen Bench

