Overview
Production Readiness
0.6
Novelty Score
0.55
Cost Impact Score
0.45
Citation Count
144
Why It Matters For Business
You can keep using a black-box LLM while reducing harmful hallucinations by adding retrieval, evidence consolidation, and automated feedback—improving factuality with modest engineering instead of costly fine-tuning.
Summary TLDR
LLM-AUGMENTER is a plug-and-play system that wraps a frozen LLM (ChatGPT in experiments) with modules that (1) retrieve and consolidate external evidence, (2) prompt the LLM with that evidence, and (3) generate automated feedback to iteratively revise responses. On dialog and open-domain QA tests the system meaningfully reduces hallucinations (measured by Knowledge-F1 and F1) while keeping responses fluent. The design supports rule-based or learned policies and can use BM25/DPR/CORE retrievers and self-criticism or rule-based feedback.
Problem Statement
Large frozen LLMs often hallucinate and cannot access fresh or private knowledge. Fine-tuning is costly or impossible for black-box LLMs. The paper asks: can we wrap a fixed LLM with retrieval, evidence consolidation, and automated feedback to reduce hallucinations and improve factual grounding without changing model weights?
Main Contribution
Design of LLM-AUGMENTER: modular pipeline (Working Memory, Policy, Action Executor, Utility) to add retrieval, consolidation, and feedback around a frozen LLM.
Show empirically that retrieved and consolidated evidence plus automated feedback reduces hallucination and improves factual scores on two scenarios: information-seeking dialog and multi-hop Wiki QA.
Demonstrate both rule-based and learnable policies; release code and models to reproduce the pipeline.
Key Findings
Retrieving consolidated evidence raises knowledge grounding (KF1) by about +10 points on news dialog.
Automated feedback further improves grounding (KF1) by several points in dialog tasks.
Human raters prefer the augmented system: usefulness up +11 points and humanness up +4.3 points (customer service).
On multi-hop Wiki QA, consolidated evidence + feedback boosts short-answer F1 dramatically versus closed-book ChatGPT.
Results
News Chat KF1
Customer Service KF1
Human eval - Usefulness
Human eval - Humanness
Wiki QA token-level F1
Who Should Care
What To Try In 7 Days
Add a simple retriever (BM25) and include top-k passages in the prompt for a closed-book LLM.
Implement a rule-based utility that checks overlap with retrieved evidence (KF1) and rejects answers below threshold.
Create a simple template feedback message to re-prompt the LLM when grounding is missing.
Agent Features
Memory
- Working Memory stores dialog history, evidence, candidates, utilities
Planning
- policy selects next action (retrieve, prompt, send)
- policy can be rule-based or RL-trained
Tool Use
- retriever APIs (BM25, DPR)
- external web/task DB APIs
- LLM prompting (ChatGPT)
Frameworks
- MDP formulation; policy optimized with REINFORCE
Is Agentic
true
Architectures
- modular pipeline (Working Memory, Policy, Action Executor, Utility)
Optimization Features
System Optimization
- iterative prompting with verification to avoid sending hallucinated answers
Training Optimization
- policy bootstrapped from rules, trained with simulated users, fine-tuned with RL
Inference Optimization
- always-use vs self-ask policy tradeoff to reduce retrieval cost
Reproducibility
Code Urls
Data Urls
- DSTC7 News Chat (dataset name)
- DSTC11 Customer Service (Track 5, dataset name)
- OTT-QA / Wiki QA (dataset name)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Extra LLM queries (often two prompts) increase latency and API cost.
- Performance depends on retrieval coverage; if evidence is missing the system can still hallucinate.
- Main ChatGPT experiments used a manual rule-based policy; RL results are limited to cheaper models (T5-Base).
When Not To Use
- Applications that require single-round ultra-low latency or minimal API cost.
- Scenarios where reliable external knowledge sources are unavailable or untrusted.
- Cases that cannot tolerate additional system complexity or retrieval maintenance.
Failure Modes
- No supporting evidence retrieved leads to persistent hallucination or abstention.
- Utility functions can be biased or misaligned and may accept plausible but incorrect answers.
- External knowledge sources can introduce wrong or adversarial facts into prompts.
Core Entities
Models
- ChatGPT
- T5-Base
- DPR
- CORE
Metrics
- Knowledge F1
- BLEU
- ROUGE
- METEOR
- BLEURT
- BERTScore
- BARTScore
- Token-level F1/Precision/Recall
Datasets
- DSTC7 News Chat
- DSTC11 Customer Service (Track 5)
- OTT-QA (Wiki QA)
Benchmarks
- News Chat (DSTC7)
- Customer Service (DSTC11)
- Wiki QA (OTT-QA)

