REFEED: refine LLM outputs by retrieving documents about the model's own answers

May 23, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.75

Citation Count

10

Authors

Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, Ashish Sabharwal

Links

Abstract / PDF

Why It Matters For Business

You can improve factual accuracy of LLM outputs at inference time without costly fine-tuning by adding a retrieval-feedback loop that conditions retrieval on model answers.

Summary TLDR

REFEED is a plug-and-play pipeline that refines a language model's generated answer by using that answer (or many sampled answers) to retrieve supporting documents, then re-prompting the model with those documents to produce a revised answer. On four knowledge-heavy benchmarks (NQ, TriviaQA, HotpotQA, WoW) REFEED improves accuracy over closed-book and basic retrieve-then-read baselines. Two practical modules—diverse answer generation and an ensemble that picks before/after answers by likelihood—reduce cases where retrieval misleads the model.

Problem Statement

Large LLMs still hallucinate or give outdated / incomplete facts. Human feedback and fine-tuning help but are costly and cannot be applied at inference time. We need an inexpensive, inference-time way to automatically check and improve individual generated outputs using external documents without fine-tuning.

Main Contribution

Propose REFEED, a plug-and-play retrieval-feedback loop that conditions retrieval on the model's own generated answer to produce targeted supporting documents.

Introduce two practical modules: (1) diverse answer generation (sample multiple answers to widen retrieval coverage) and (2) an ensemble that picks the better answer by comparing log-likelihood before vs after retrieval.

Show consistent accuracy gains on four knowledge-intensive benchmarks (zero- and few-shot), and show REFEED works with chain-of-thought prompting.

Key Findings

REFEED improves open-domain QA accuracy over retrieve-then-read baselines in zero-shot experiments.

Numbers+~6% overall (reported) zero-shot improvement

REFEED yields measurable dataset gains versus retrieve-then-read (example: NQ EM 31.7 → 39.6).

NumbersNQ EM +7.9 (31.7 → 39.6)

Both proposed modules help: removing diverse generation drops EM by ~1.1 and removing the ensemble drops EM by ~0.8 on average.

NumbersAblation: diverse −1.1 EM, ensemble −0.8 EM

Retrieval feedback can sometimes mislead the model and produce worse answers.

Numbersdocumented negative case examples in case study

Results

EM

Value39.6

BaselineRetrieve-then-Read EM 31.7

EM

Value68.9

BaselineRetrieve-then-Read EM 61.4

EM

Value46.4

BaselineRetrieve-then-Read EM 43.9

EM

Value44.2

BaselineRetrieve-Read with CoT EM 42.1

Who Should Care

What To Try In 7 Days

Implement a simple pipeline: generate answer → retrieve top-10 docs with BM25 using [question,answer] → re-prompt model with docs and compare before/after likelihoods.

Sample multiple answers (nucleus sampling) to diversify retrieval, then de-duplicate and keep top-k docs.

Add a likelihood-based ensemble: if the answer's likelihood drops after feedback, keep the original answer.

Reproducibility

Data Urls

  • NaturalQuestions, TriviaQA, HotpotQA, Wizard of Wikipedia (public datasets); Wikipedia (public corpus)

Data Available

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Retrieval feedback can mislead the model if retrieved docs are irrelevant or contain confusing signals (case studies show failures).
  • Experiments use closed datasets and specific OpenAI models (TD-003, Codex); gains may vary with other models or corpora.
  • No public code release in the paper, making exact reproduction harder.

When Not To Use

  • When you cannot run retrieval over a relevant, up-to-date corpus at inference time.
  • When strict latency or cost constraints forbid extra retrieval and extra model calls.
  • When model likelihoods are unavailable for ensemble decisions (black-box APIs without log-probabilities).

Failure Modes

  • Retrieved documents introduce misleading facts and cause correct answers to flip incorrect (observed in Figure 5).
  • Lexical overlap introduced by wrong generated answers can surface documents that reinforce the wrong answer.
  • If retriever quality is poor, refinement can add noise rather than correction.

Core Entities

Models

  • text-davinci-003
  • code-davinci-002 (Codex)
  • InstructGPT (baseline references)

Metrics

  • Exact Match (EM)
  • F1
  • Recall@K
  • Rouge-L

Datasets

  • NaturalQuestions (NQ)
  • TriviaQA
  • HotpotQA
  • Wizard of Wikipedia (WoW)
  • Wikipedia (corpus)

Benchmarks

  • KILT (splits for HotpotQA and WoW used)

Context Entities

Models

  • DPR, ORQA (related retrieval methods referenced)