Pick and fuse the best outputs from many open LLMs using pairwise ranking plus a small fusion model

June 5, 20237 min

Overview

Decision SnapshotReady For Pilot

The method is practical: ranking + fusion reliably improves automatic and GPT-based metrics, but full pairwise scoring costs compute; use bubble-run or parallelize to deploy.

Citations9

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Dongfu Jiang, Xiang Ren, Bill Yuchen Lin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Combining multiple open LLMs by ranking and fusing their outputs produces more reliable and higher-quality answers than any single open model on a mixed instruction benchmark.

Who Should Care

Summary TLDR

LLM-BLENDER ensembles multiple open-source LLMs in two steps: PAIRRANKER compares every pair of candidate responses with a cross-attention encoder to pick the best candidates; GENFUSER (a seq2seq model) then fuses the top K candidates to generate a final answer. The authors release MixInstruct (110k examples) for training and evaluation. On MixInstruct, PAIRRANKER improves ranking correlation with ChatGPT (GPT-Rank) and LLM-BLENDER pushes GPT-Rank from 3.90 (best single LLM) to 3.01 and raises top-3 frequency to 68.6%. Use bubble-run or parallelize comparisons to reduce cost.

Problem Statement

Open-source LLMs have complementary strengths but no single model wins on every input. Existing rankers score candidates independently and miss subtle differences between high-quality outputs. We need a practical ensembling method that picks the best model per input and can fuse multiple good candidates into an even better response.

Main Contribution

A rank-and-fuse system (LLM-BLENDER) composed of PAIRRANKER (pairwise cross-attention ranker) and GENFUSER (seq2seq fusion) that ensembles outputs from many LLMs.

MixInstruct: a new ensemble benchmark with 110K examples, oracle pairwise labels via ChatGPT, and a 5K test split for automatic evaluation.

Key Findings

PAIRRANKER correlates best with ChatGPT-based ranking (GPT-Rank).

NumbersPearson correlation: 46.98 (PAIRRANKER) vs 41.13 (SummaReranker)

Practical UseTrain a pairwise cross-encoder rather than pointwise scorers when candidates are close in quality; expect noticeably better alignment with human-like judgments.

Evidence RefTable 3

LLM-BLENDER outperforms the best individual open LLM on MixInstruct by multiple metrics.

NumbersGPT-Rank 3.01 (LLM-BLENDER) vs 3.90 (best LLM OpenAssistant)

Practical UseIf you can run several open LLMs, rank then fuse their top outputs to get measurably better responses than picking a single model.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPT-Rank (lower is better)3.01OpenAssistant 3.90-0.89MixInstruct (test)Table 2 reports GPT-Rank for LLM-BLENDER and single LLMsTable 2
BERTScore (higher better)79.09OpenAssistant 74.68+4.41MixInstruct (test)Table 2 shows BERTScore for methodsTable 2

What To Try In 7 Days

Collect 5–10 candidate outputs from different open LLMs for your input set, then train a pairwise cross-encoder (DeBERTa) on a small labeled sample to pick winners.

Implement bubble-run pairwise selection (N-1 comparisons) to get near-best ranking with linear cost and parallelize comparisons when possible.

Fine-tune a small seq2seq model (Flan-T5-XL or -3b) to fuse the top-3 ranked candidates into a single output and compare to your default model.

Optimization Features

Inference Optimization
bubble-run reduces pairwise comparisons to O(N)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

PAIRRANKER full inference is O(N^2) comparisons which can be expensive for many models.

Evaluation relies heavily on ChatGPT-based GPT-Rank instead of large-scale human judgments.

When Not To Use

When you cannot afford running multiple LLMs in parallel or lack compute for pairwise comparisons.

When human evaluation is required for safety-critical outputs and automated judges are insufficient.

Failure Modes

Ranking model misorders near-tied candidates, causing suboptimal fusion inputs.

GENFUSER could synthesize incorrect facts present across top candidates.

Core Entities

Models

VicunaOpenAssistantAlpacaMOSSBaizeChatGLMKoalaDolly V2MPTStableLMFlan-T5DeBERTa (backbone for PAIRRANKER)Flan-T5-XL (GENFUSER)

Metrics

GPT-RankBERTScoreBARTScoreBLEURTBLEUROUGECIDErSpearmanPearson

Datasets

MixInstructCNN/DailyMailCommonGenWMT18 (zh-en)

Benchmarks

MixInstruct