REFEED: refine LLM outputs by retrieving documents about the model's own answers

May 23, 20237 min

Overview

Decision SnapshotReady For Pilot

REFEED is practical for production pilots: it needs only a retriever and prompts, raises accuracy on several benchmarks, but requires safeguards because retrieval can sometimes hurt answers.

Citations10

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 75%

Production readiness: 60%

Novelty: 50%

Authors

Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, Ashish Sabharwal

Links

Abstract / PDF / Data

Why It Matters For Business

You can improve factual accuracy of LLM outputs at inference time without costly fine-tuning by adding a retrieval-feedback loop that conditions retrieval on model answers.

Who Should Care

Summary TLDR

REFEED is a plug-and-play pipeline that refines a language model's generated answer by using that answer (or many sampled answers) to retrieve supporting documents, then re-prompting the model with those documents to produce a revised answer. On four knowledge-heavy benchmarks (NQ, TriviaQA, HotpotQA, WoW) REFEED improves accuracy over closed-book and basic retrieve-then-read baselines. Two practical modules—diverse answer generation and an ensemble that picks before/after answers by likelihood—reduce cases where retrieval misleads the model.

Problem Statement

Large LLMs still hallucinate or give outdated / incomplete facts. Human feedback and fine-tuning help but are costly and cannot be applied at inference time. We need an inexpensive, inference-time way to automatically check and improve individual generated outputs using external documents without fine-tuning.

Main Contribution

Propose REFEED, a plug-and-play retrieval-feedback loop that conditions retrieval on the model's own generated answer to produce targeted supporting documents.

Introduce two practical modules: (1) diverse answer generation (sample multiple answers to widen retrieval coverage) and (2) an ensemble that picks the better answer by comparing log-likelihood before vs after retrieval.

Key Findings

REFEED improves open-domain QA accuracy over retrieve-then-read baselines in zero-shot experiments.

Numbers+~6% overall (reported) zero-shot improvement

Practical UseAdd retrieval feedback at inference to raise factual accuracy without fine-tuning the model.

Evidence RefAbstract, Sec 4.3.1, Table 1

REFEED yields measurable dataset gains versus retrieve-then-read (example: NQ EM 31.7 → 39.6).

NumbersNQ EM +7.9 (31.739.6)

Practical UseOn questions like NQ, conditioning retrieval on model answers finds better supporting docs and can substantially raise exact-match scores.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
EM39.6Retrieve-then-Read EM 31.7+7.9NQ zero-shot (TD-003 backbone)Table 1 (zero-shot)Table 1
EM68.9Retrieve-then-Read EM 61.4+7.5TriviaQA zero-shot (TD-003 backbone)Table 1 (zero-shot)Table 1

What To Try In 7 Days

Implement a simple pipeline: generate answer → retrieve top-10 docs with BM25 using [question,answer] → re-prompt model with docs and compare before/after likelihoods.

Sample multiple answers (nucleus sampling) to diversify retrieval, then de-duplicate and keep top-k docs.

Add a likelihood-based ensemble: if the answer's likelihood drops after feedback, keep the original answer.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusNo
LicenseUnknown

Data URLs

NaturalQuestions, TriviaQA, HotpotQA, Wizard of Wikipedia (public datasets); Wikipedia (public corpus)

Risks & Boundaries

Limitations

Retrieval feedback can mislead the model if retrieved docs are irrelevant or contain confusing signals (case studies show failures).

Experiments use closed datasets and specific OpenAI models (TD-003, Codex); gains may vary with other models or corpora.

When Not To Use

When you cannot run retrieval over a relevant, up-to-date corpus at inference time.

When strict latency or cost constraints forbid extra retrieval and extra model calls.

Failure Modes

Retrieved documents introduce misleading facts and cause correct answers to flip incorrect (observed in Figure 5).

Lexical overlap introduced by wrong generated answers can surface documents that reinforce the wrong answer.

Core Entities

Models

text-davinci-003code-davinci-002 (Codex)InstructGPT (baseline references)

Metrics

Exact Match (EM)F1Recall@KRouge-L

Datasets

NaturalQuestions (NQ)TriviaQAHotpotQAWizard of Wikipedia (WoW)Wikipedia (corpus)

Benchmarks

KILT (splits for HotpotQA and WoW used)

Context Entities

Models

DPR, ORQA (related retrieval methods referenced)