Pick and fuse the best outputs from many open LLMs using pairwise ranking plus a small fusion model

June 5, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

9

Authors

Dongfu Jiang, Xiang Ren, Bill Yuchen Lin

Links

Abstract / PDF

Why It Matters For Business

Combining multiple open LLMs by ranking and fusing their outputs produces more reliable and higher-quality answers than any single open model on a mixed instruction benchmark.

Summary TLDR

LLM-BLENDER ensembles multiple open-source LLMs in two steps: PAIRRANKER compares every pair of candidate responses with a cross-attention encoder to pick the best candidates; GENFUSER (a seq2seq model) then fuses the top K candidates to generate a final answer. The authors release MixInstruct (110k examples) for training and evaluation. On MixInstruct, PAIRRANKER improves ranking correlation with ChatGPT (GPT-Rank) and LLM-BLENDER pushes GPT-Rank from 3.90 (best single LLM) to 3.01 and raises top-3 frequency to 68.6%. Use bubble-run or parallelize comparisons to reduce cost.

Problem Statement

Open-source LLMs have complementary strengths but no single model wins on every input. Existing rankers score candidates independently and miss subtle differences between high-quality outputs. We need a practical ensembling method that picks the best model per input and can fuse multiple good candidates into an even better response.

Main Contribution

A rank-and-fuse system (LLM-BLENDER) composed of PAIRRANKER (pairwise cross-attention ranker) and GENFUSER (seq2seq fusion) that ensembles outputs from many LLMs.

MixInstruct: a new ensemble benchmark with 110K examples, oracle pairwise labels via ChatGPT, and a 5K test split for automatic evaluation.

Extensive experiments showing PAIRRANKER bests prior rerankers and that fusion of top-k candidates gives further gains over best single LLMs.

Key Findings

PAIRRANKER correlates best with ChatGPT-based ranking (GPT-Rank).

NumbersPearson correlation: 46.98 (PAIRRANKER) vs 41.13 (SummaReranker)

LLM-BLENDER outperforms the best individual open LLM on MixInstruct by multiple metrics.

NumbersGPT-Rank 3.01 (LLM-BLENDER) vs 3.90 (best LLM OpenAssistant)

Generative fusion lifts automatic quality scores beyond top candidates.

NumbersBERTScore 79.09 (LLM-BLENDER) vs 74.68 (OpenAssistant), gain +4.41

Pairwise ranking can be costly but can be sped up with bubble-run.

NumbersFull pairwise cost O(N^2); bubble-run reduces to O(N) with good empirical performance (N-1 comparisons)

Results

GPT-Rank (lower is better)

Value3.01

BaselineOpenAssistant 3.90

BERTScore (higher better)

Value79.09

BaselineOpenAssistant 74.68

Top-3 frequency

Value68.59%

BaselineVicuna 52.88%

PAIRRANKER Pearson correlation with GPT-Rank

Value46.98

BaselineSummaReranker 41.13

Who Should Care

What To Try In 7 Days

Collect 5–10 candidate outputs from different open LLMs for your input set, then train a pairwise cross-encoder (DeBERTa) on a small labeled sample to pick winners.

Implement bubble-run pairwise selection (N-1 comparisons) to get near-best ranking with linear cost and parallelize comparisons when possible.

Fine-tune a small seq2seq model (Flan-T5-XL or -3b) to fuse the top-3 ranked candidates into a single output and compare to your default model.

Optimization Features

Inference Optimization

  • bubble-run reduces pairwise comparisons to O(N)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • PAIRRANKER full inference is O(N^2) comparisons which can be expensive for many models.
  • Evaluation relies heavily on ChatGPT-based GPT-Rank instead of large-scale human judgments.
  • GENFUSER may hallucinate or combine errors from candidates; fusion quality depends on fusion model training data.

When Not To Use

  • When you cannot afford running multiple LLMs in parallel or lack compute for pairwise comparisons.
  • When human evaluation is required for safety-critical outputs and automated judges are insufficient.
  • If your use case requires strict provenance of a single model's output rather than fused text.

Failure Modes

  • Ranking model misorders near-tied candidates, causing suboptimal fusion inputs.
  • GENFUSER could synthesize incorrect facts present across top candidates.
  • Domain shift: PAIRRANKER trained on MixInstruct may not transfer to very different task distributions.

Core Entities

Models

  • Vicuna
  • OpenAssistant
  • Alpaca
  • MOSS
  • Baize
  • ChatGLM
  • Koala
  • Dolly V2
  • MPT
  • StableLM
  • Flan-T5
  • DeBERTa (backbone for PAIRRANKER)
  • Flan-T5-XL (GENFUSER)

Metrics

  • GPT-Rank
  • BERTScore
  • BARTScore
  • BLEURT
  • BLEU
  • ROUGE
  • CIDEr
  • Spearman
  • Pearson

Datasets

  • MixInstruct
  • CNN/DailyMail
  • CommonGen
  • WMT18 (zh-en)

Benchmarks

  • MixInstruct