Augment ChatGPT with retrieved evidence and automated feedback to cut hallucinations

February 24, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.45

Citation Count

144

Authors

Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, Jianfeng Gao

Links

Abstract / PDF

Why It Matters For Business

You can keep using a black-box LLM while reducing harmful hallucinations by adding retrieval, evidence consolidation, and automated feedback—improving factuality with modest engineering instead of costly fine-tuning.

Summary TLDR

LLM-AUGMENTER is a plug-and-play system that wraps a frozen LLM (ChatGPT in experiments) with modules that (1) retrieve and consolidate external evidence, (2) prompt the LLM with that evidence, and (3) generate automated feedback to iteratively revise responses. On dialog and open-domain QA tests the system meaningfully reduces hallucinations (measured by Knowledge-F1 and F1) while keeping responses fluent. The design supports rule-based or learned policies and can use BM25/DPR/CORE retrievers and self-criticism or rule-based feedback.

Problem Statement

Large frozen LLMs often hallucinate and cannot access fresh or private knowledge. Fine-tuning is costly or impossible for black-box LLMs. The paper asks: can we wrap a fixed LLM with retrieval, evidence consolidation, and automated feedback to reduce hallucinations and improve factual grounding without changing model weights?

Main Contribution

Design of LLM-AUGMENTER: modular pipeline (Working Memory, Policy, Action Executor, Utility) to add retrieval, consolidation, and feedback around a frozen LLM.

Show empirically that retrieved and consolidated evidence plus automated feedback reduces hallucination and improves factual scores on two scenarios: information-seeking dialog and multi-hop Wiki QA.

Demonstrate both rule-based and learnable policies; release code and models to reproduce the pipeline.

Key Findings

Retrieving consolidated evidence raises knowledge grounding (KF1) by about +10 points on news dialog.

NumbersKF1: 26.71 -> 36.41 (ChatGPT -> LLM-AUGMENTER, News Chat, Table 1)

Automated feedback further improves grounding (KF1) by several points in dialog tasks.

NumbersNews Chat +3.3 KF1; Customer Service +7.2 KF1 (oracle/gold settings cited)

Human raters prefer the augmented system: usefulness up +11 points and humanness up +4.3 points (customer service).

NumbersUsefulness: 34.07 -> 45.07; Humanness: 30.92 -> 35.22 (Table 3)

On multi-hop Wiki QA, consolidated evidence + feedback boosts short-answer F1 dramatically versus closed-book ChatGPT.

NumbersF1: 0.59 -> 11.80 (ChatGPT -> LLM-AUGMENTER with CORE+feedback, Table 5)

Results

News Chat KF1

ValueChatGPT 26.71 -> LLM-AUGMENTER BM25+feedback 36.41

BaselineChatGPT (closed-book)

Customer Service KF1

ValueChatGPT 31.33 -> LLM-AUGMENTER BM25+feedback 37.41

BaselineChatGPT (closed-book)

Human eval - Usefulness

ValueChatGPT 34.07 -> LLM-AUGMENTER 45.07

BaselineChatGPT

Human eval - Humanness

ValueChatGPT 30.92 -> LLM-AUGMENTER 35.22

BaselineChatGPT

Wiki QA token-level F1

ValueChatGPT 0.59 -> LLM-AUGMENTER CORE+feedback 11.80

BaselineChatGPT closed-book

Who Should Care

What To Try In 7 Days

Add a simple retriever (BM25) and include top-k passages in the prompt for a closed-book LLM.

Implement a rule-based utility that checks overlap with retrieved evidence (KF1) and rejects answers below threshold.

Create a simple template feedback message to re-prompt the LLM when grounding is missing.

Agent Features

Memory

  • Working Memory stores dialog history, evidence, candidates, utilities

Planning

  • policy selects next action (retrieve, prompt, send)
  • policy can be rule-based or RL-trained

Tool Use

  • retriever APIs (BM25, DPR)
  • external web/task DB APIs
  • LLM prompting (ChatGPT)

Frameworks

  • MDP formulation; policy optimized with REINFORCE

Is Agentic

true

Architectures

  • modular pipeline (Working Memory, Policy, Action Executor, Utility)

Optimization Features

System Optimization

  • iterative prompting with verification to avoid sending hallucinated answers

Training Optimization

  • policy bootstrapped from rules, trained with simulated users, fine-tuned with RL

Inference Optimization

  • always-use vs self-ask policy tradeoff to reduce retrieval cost

Reproducibility

Data Urls

  • DSTC7 News Chat (dataset name)
  • DSTC11 Customer Service (Track 5, dataset name)
  • OTT-QA / Wiki QA (dataset name)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Extra LLM queries (often two prompts) increase latency and API cost.
  • Performance depends on retrieval coverage; if evidence is missing the system can still hallucinate.
  • Main ChatGPT experiments used a manual rule-based policy; RL results are limited to cheaper models (T5-Base).

When Not To Use

  • Applications that require single-round ultra-low latency or minimal API cost.
  • Scenarios where reliable external knowledge sources are unavailable or untrusted.
  • Cases that cannot tolerate additional system complexity or retrieval maintenance.

Failure Modes

  • No supporting evidence retrieved leads to persistent hallucination or abstention.
  • Utility functions can be biased or misaligned and may accept plausible but incorrect answers.
  • External knowledge sources can introduce wrong or adversarial facts into prompts.

Core Entities

Models

  • ChatGPT
  • T5-Base
  • DPR
  • CORE

Metrics

  • Knowledge F1
  • BLEU
  • ROUGE
  • METEOR
  • BLEURT
  • BERTScore
  • BARTScore
  • Token-level F1/Precision/Recall

Datasets

  • DSTC7 News Chat
  • DSTC11 Customer Service (Track 5)
  • OTT-QA (Wiki QA)

Benchmarks

  • News Chat (DSTC7)
  • Customer Service (DSTC11)
  • Wiki QA (OTT-QA)