REFEED: refine LLM outputs by retrieving documents about the model's own answers

Overview

Decision SnapshotReady For Pilot

REFEED is practical for production pilots: it needs only a retriever and prompts, raises accuracy on several benchmarks, but requires safeguards because retrieval can sometimes hurt answers.

Citations10

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 75%

Production readiness: 60%

Novelty: 50%

Authors

Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, Ashish Sabharwal

Links

Abstract / PDF / Data

Why It Matters For Business

You can improve factual accuracy of LLM outputs at inference time without costly fine-tuning by adding a retrieval-feedback loop that conditions retrieval on model answers.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

REFEED is a plug-and-play pipeline that refines a language model's generated answer by using that answer (or many sampled answers) to retrieve supporting documents, then re-prompting the model with those documents to produce a revised answer. On four knowledge-heavy benchmarks (NQ, TriviaQA, HotpotQA, WoW) REFEED improves accuracy over closed-book and basic retrieve-then-read baselines. Two practical modules—diverse answer generation and an ensemble that picks before/after answers by likelihood—reduce cases where retrieval misleads the model.

Problem Statement

Large LLMs still hallucinate or give outdated / incomplete facts. Human feedback and fine-tuning help but are costly and cannot be applied at inference time. We need an inexpensive, inference-time way to automatically check and improve individual generated outputs using external documents without fine-tuning.

Main Contribution

Propose REFEED, a plug-and-play retrieval-feedback loop that conditions retrieval on the model's own generated answer to produce targeted supporting documents.

Introduce two practical modules: (1) diverse answer generation (sample multiple answers to widen retrieval coverage) and (2) an ensemble that picks the better answer by comparing log-likelihood before vs after retrieval.

Key Findings

REFEED improves open-domain QA accuracy over retrieve-then-read baselines in zero-shot experiments.

Numbers+~6% overall (reported) zero-shot improvement

Practical UseAdd retrieval feedback at inference to raise factual accuracy without fine-tuning the model.

Evidence RefAbstract, Sec 4.3.1, Table 1

REFEED yields measurable dataset gains versus retrieve-then-read (example: NQ EM 31.7 → 39.6).

NumbersNQ EM +7.9 (31.7 → 39.6)

Practical UseOn questions like NQ, conditioning retrieval on model answers finds better supporting docs and can substantially raise exact-match scores.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
EM	39.6	Retrieve-then-Read EM 31.7	+7.9	NQ zero-shot (TD-003 backbone)	Table 1 (zero-shot)	Table 1
EM	68.9	Retrieve-then-Read EM 61.4	+7.5	TriviaQA zero-shot (TD-003 backbone)	Table 1 (zero-shot)	Table 1

What To Try In 7 Days

Implement a simple pipeline: generate answer → retrieve top-10 docs with BM25 using [question,answer] → re-prompt model with docs and compare before/after likelihoods.

Sample multiple answers (nucleus sampling) to diversify retrieval, then de-duplicate and keep top-k docs.

Add a likelihood-based ensemble: if the answer's likelihood drops after feedback, keep the original answer.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusNo

LicenseUnknown

Data URLs

NaturalQuestions, TriviaQA, HotpotQA, Wizard of Wikipedia (public datasets); Wikipedia (public corpus)

Risks & Boundaries

Limitations

Retrieval feedback can mislead the model if retrieved docs are irrelevant or contain confusing signals (case studies show failures).

Experiments use closed datasets and specific OpenAI models (TD-003, Codex); gains may vary with other models or corpora.

When Not To Use

When you cannot run retrieval over a relevant, up-to-date corpus at inference time.

When strict latency or cost constraints forbid extra retrieval and extra model calls.

Failure Modes

Retrieved documents introduce misleading facts and cause correct answers to flip incorrect (observed in Figure 5).

Lexical overlap introduced by wrong generated answers can surface documents that reinforce the wrong answer.

Core Entities

Models

text-davinci-003code-davinci-002 (Codex)InstructGPT (baseline references)

Metrics

Exact Match (EM)F1Recall@KRouge-L

Datasets

NaturalQuestions (NQ)TriviaQAHotpotQAWizard of Wikipedia (WoW)Wikipedia (corpus)

REFEED: refine LLM outputs by retrieving documents about the model's own answers

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

REFEED improves open-domain QA accuracy over retrieve-then-read baselines in zero-shot experiments.

REFEED yields measurable dataset gains versus retrieve-then-read (example: NQ EM 31.7 → 39.6).

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

REFEED improves open-domain QA accuracy over retrieve-then-read baselines in zero-shot experiments.

REFEED yields measurable dataset gains versus retrieve-then-read (example: NQ EM 31.7 → 39.6).

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

MTRAG: a human-made benchmark of multi-turn RAG conversations that stresses retrieval, unanswerables, and later-turn context.

Key finding

Atomic fact-checking for medical RAG LLMs boosts factuality and traceability

Key finding

Build query-specific evidence graphs on the fly to fix missing links and filter distractor facts

Key finding

RAGLAB — an open, modular toolkit to reproduce, compare and develop RAG algorithms fairly

Key finding

InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

Key finding