Overview
REFEED is practical for production pilots: it needs only a retriever and prompts, raises accuracy on several benchmarks, but requires safeguards because retrieval can sometimes hurt answers.
Citations10
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: No
At A Glance
Cost impact: 75%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
You can improve factual accuracy of LLM outputs at inference time without costly fine-tuning by adding a retrieval-feedback loop that conditions retrieval on model answers.
Who Should Care
Summary TLDR
REFEED is a plug-and-play pipeline that refines a language model's generated answer by using that answer (or many sampled answers) to retrieve supporting documents, then re-prompting the model with those documents to produce a revised answer. On four knowledge-heavy benchmarks (NQ, TriviaQA, HotpotQA, WoW) REFEED improves accuracy over closed-book and basic retrieve-then-read baselines. Two practical modules—diverse answer generation and an ensemble that picks before/after answers by likelihood—reduce cases where retrieval misleads the model.
Problem Statement
Large LLMs still hallucinate or give outdated / incomplete facts. Human feedback and fine-tuning help but are costly and cannot be applied at inference time. We need an inexpensive, inference-time way to automatically check and improve individual generated outputs using external documents without fine-tuning.
Main Contribution
Propose REFEED, a plug-and-play retrieval-feedback loop that conditions retrieval on the model's own generated answer to produce targeted supporting documents.
Introduce two practical modules: (1) diverse answer generation (sample multiple answers to widen retrieval coverage) and (2) an ensemble that picks the better answer by comparing log-likelihood before vs after retrieval.
Key Findings
REFEED improves open-domain QA accuracy over retrieve-then-read baselines in zero-shot experiments.
REFEED yields measurable dataset gains versus retrieve-then-read (example: NQ EM 31.7 → 39.6).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| EM | 39.6 | Retrieve-then-Read EM 31.7 | +7.9 | NQ zero-shot (TD-003 backbone) | Table 1 (zero-shot) | Table 1 |
| EM | 68.9 | Retrieve-then-Read EM 61.4 | +7.5 | TriviaQA zero-shot (TD-003 backbone) | Table 1 (zero-shot) | Table 1 |
What To Try In 7 Days
Implement a simple pipeline: generate answer → retrieve top-10 docs with BM25 using [question,answer] → re-prompt model with docs and compare before/after likelihoods.
Sample multiple answers (nucleus sampling) to diversify retrieval, then de-duplicate and keep top-k docs.
Add a likelihood-based ensemble: if the answer's likelihood drops after feedback, keep the original answer.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Retrieval feedback can mislead the model if retrieved docs are irrelevant or contain confusing signals (case studies show failures).
Experiments use closed datasets and specific OpenAI models (TD-003, Codex); gains may vary with other models or corpora.
When Not To Use
When you cannot run retrieval over a relevant, up-to-date corpus at inference time.
When strict latency or cost constraints forbid extra retrieval and extra model calls.
Failure Modes
Retrieved documents introduce misleading facts and cause correct answers to flip incorrect (observed in Figure 5).
Lexical overlap introduced by wrong generated answers can surface documents that reinforce the wrong answer.

