Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.75
Citation Count
10
Why It Matters For Business
You can improve factual accuracy of LLM outputs at inference time without costly fine-tuning by adding a retrieval-feedback loop that conditions retrieval on model answers.
Summary TLDR
REFEED is a plug-and-play pipeline that refines a language model's generated answer by using that answer (or many sampled answers) to retrieve supporting documents, then re-prompting the model with those documents to produce a revised answer. On four knowledge-heavy benchmarks (NQ, TriviaQA, HotpotQA, WoW) REFEED improves accuracy over closed-book and basic retrieve-then-read baselines. Two practical modules—diverse answer generation and an ensemble that picks before/after answers by likelihood—reduce cases where retrieval misleads the model.
Problem Statement
Large LLMs still hallucinate or give outdated / incomplete facts. Human feedback and fine-tuning help but are costly and cannot be applied at inference time. We need an inexpensive, inference-time way to automatically check and improve individual generated outputs using external documents without fine-tuning.
Main Contribution
Propose REFEED, a plug-and-play retrieval-feedback loop that conditions retrieval on the model's own generated answer to produce targeted supporting documents.
Introduce two practical modules: (1) diverse answer generation (sample multiple answers to widen retrieval coverage) and (2) an ensemble that picks the better answer by comparing log-likelihood before vs after retrieval.
Show consistent accuracy gains on four knowledge-intensive benchmarks (zero- and few-shot), and show REFEED works with chain-of-thought prompting.
Key Findings
REFEED improves open-domain QA accuracy over retrieve-then-read baselines in zero-shot experiments.
REFEED yields measurable dataset gains versus retrieve-then-read (example: NQ EM 31.7 → 39.6).
Both proposed modules help: removing diverse generation drops EM by ~1.1 and removing the ensemble drops EM by ~0.8 on average.
Retrieval feedback can sometimes mislead the model and produce worse answers.
Results
EM
EM
EM
EM
Who Should Care
What To Try In 7 Days
Implement a simple pipeline: generate answer → retrieve top-10 docs with BM25 using [question,answer] → re-prompt model with docs and compare before/after likelihoods.
Sample multiple answers (nucleus sampling) to diversify retrieval, then de-duplicate and keep top-k docs.
Add a likelihood-based ensemble: if the answer's likelihood drops after feedback, keep the original answer.
Reproducibility
Data Urls
- NaturalQuestions, TriviaQA, HotpotQA, Wizard of Wikipedia (public datasets); Wikipedia (public corpus)
Data Available
Open Source Status
- no
Risks & Boundaries
Limitations
- Retrieval feedback can mislead the model if retrieved docs are irrelevant or contain confusing signals (case studies show failures).
- Experiments use closed datasets and specific OpenAI models (TD-003, Codex); gains may vary with other models or corpora.
- No public code release in the paper, making exact reproduction harder.
When Not To Use
- When you cannot run retrieval over a relevant, up-to-date corpus at inference time.
- When strict latency or cost constraints forbid extra retrieval and extra model calls.
- When model likelihoods are unavailable for ensemble decisions (black-box APIs without log-probabilities).
Failure Modes
- Retrieved documents introduce misleading facts and cause correct answers to flip incorrect (observed in Figure 5).
- Lexical overlap introduced by wrong generated answers can surface documents that reinforce the wrong answer.
- If retriever quality is poor, refinement can add noise rather than correction.
Core Entities
Models
- text-davinci-003
- code-davinci-002 (Codex)
- InstructGPT (baseline references)
Metrics
- Exact Match (EM)
- F1
- Recall@K
- Rouge-L
Datasets
- NaturalQuestions (NQ)
- TriviaQA
- HotpotQA
- Wizard of Wikipedia (WoW)
- Wikipedia (corpus)
Benchmarks
- KILT (splits for HotpotQA and WoW used)
Context Entities
Models
- DPR, ORQA (related retrieval methods referenced)

