Pick and fuse the best outputs from many open LLMs using pairwise ranking plus a small fusion model

Overview

Decision SnapshotReady For Pilot

The method is practical: ranking + fusion reliably improves automatic and GPT-based metrics, but full pairwise scoring costs compute; use bubble-run or parallelize to deploy.

Citations9

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Dongfu Jiang, Xiang Ren, Bill Yuchen Lin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Combining multiple open LLMs by ranking and fusing their outputs produces more reliable and higher-quality answers than any single open model on a mixed instruction benchmark.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

LLM-BLENDER ensembles multiple open-source LLMs in two steps: PAIRRANKER compares every pair of candidate responses with a cross-attention encoder to pick the best candidates; GENFUSER (a seq2seq model) then fuses the top K candidates to generate a final answer. The authors release MixInstruct (110k examples) for training and evaluation. On MixInstruct, PAIRRANKER improves ranking correlation with ChatGPT (GPT-Rank) and LLM-BLENDER pushes GPT-Rank from 3.90 (best single LLM) to 3.01 and raises top-3 frequency to 68.6%. Use bubble-run or parallelize comparisons to reduce cost.

Problem Statement

Open-source LLMs have complementary strengths but no single model wins on every input. Existing rankers score candidates independently and miss subtle differences between high-quality outputs. We need a practical ensembling method that picks the best model per input and can fuse multiple good candidates into an even better response.

Main Contribution

A rank-and-fuse system (LLM-BLENDER) composed of PAIRRANKER (pairwise cross-attention ranker) and GENFUSER (seq2seq fusion) that ensembles outputs from many LLMs.

MixInstruct: a new ensemble benchmark with 110K examples, oracle pairwise labels via ChatGPT, and a 5K test split for automatic evaluation.

Key Findings

PAIRRANKER correlates best with ChatGPT-based ranking (GPT-Rank).

NumbersPearson correlation: 46.98 (PAIRRANKER) vs 41.13 (SummaReranker)

Practical UseTrain a pairwise cross-encoder rather than pointwise scorers when candidates are close in quality; expect noticeably better alignment with human-like judgments.

Evidence RefTable 3

LLM-BLENDER outperforms the best individual open LLM on MixInstruct by multiple metrics.

NumbersGPT-Rank 3.01 (LLM-BLENDER) vs 3.90 (best LLM OpenAssistant)

Practical UseIf you can run several open LLMs, rank then fuse their top outputs to get measurably better responses than picking a single model.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPT-Rank (lower is better)	3.01	OpenAssistant 3.90	-0.89	MixInstruct (test)	Table 2 reports GPT-Rank for LLM-BLENDER and single LLMs	Table 2
BERTScore (higher better)	79.09	OpenAssistant 74.68	+4.41	MixInstruct (test)	Table 2 shows BERTScore for methods	Table 2

What To Try In 7 Days

Collect 5–10 candidate outputs from different open LLMs for your input set, then train a pairwise cross-encoder (DeBERTa) on a small labeled sample to pick winners.

Implement bubble-run pairwise selection (N-1 comparisons) to get near-best ranking with linear cost and parallelize comparisons when possible.

Fine-tune a small seq2seq model (Flan-T5-XL or -3b) to fuse the top-3 ranked candidates into a single output and compare to your default model.

Optimization Features

Inference Optimization

bubble-run reduces pairwise comparisons to O(N)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://yuchenlin.xyz/LLM-Blender/

Data URLs

https://yuchenlin.xyz/LLM-Blender/

Risks & Boundaries

Limitations

PAIRRANKER full inference is O(N^2) comparisons which can be expensive for many models.

Evaluation relies heavily on ChatGPT-based GPT-Rank instead of large-scale human judgments.

When Not To Use

When you cannot afford running multiple LLMs in parallel or lack compute for pairwise comparisons.

When human evaluation is required for safety-critical outputs and automated judges are insufficient.

Failure Modes

Ranking model misorders near-tied candidates, causing suboptimal fusion inputs.

GENFUSER could synthesize incorrect facts present across top candidates.

Core Entities

Models

VicunaOpenAssistantAlpacaMOSSBaizeChatGLMKoalaDolly V2MPTStableLMFlan-T5DeBERTa (backbone for PAIRRANKER)Flan-T5-XL (GENFUSER)

Metrics

GPT-RankBERTScoreBARTScoreBLEURTBLEUROUGECIDErSpearmanPearson

Datasets

MixInstructCNN/DailyMailCommonGenWMT18 (zh-en)

Benchmarks

MixInstruct

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PAIRRANKER correlates best with ChatGPT-based ranking (GPT-Rank).

LLM-BLENDER outperforms the best individual open LLM on MixInstruct by multiple metrics.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Use a small assistant LLM to remove teacher-model favoritism from proxy judge training

Key finding

Use synthetic crowd comparisons to make LLM judges give deeper, more reliable chain-of-thought evaluations

Key finding