A retrieval+claim-verification pipeline that cuts hallucinations and can be distilled to a fast 7B model

May 23, 20239 min

Overview

Decision SnapshotNeeds Validation

Strong human-evaluated factuality on Wikipedia-backed dialogs supports claims. Distillation and latency/cost numbers are measured. Limitations include English-only, single-hop retrieval, and focus on knowledge-heavy tasks.

Citations9

Evidence Strength0.90

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

License: Model release follows original LLaMA license; code released in repo (see GitHub)

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 50%

Authors

Sina J. Semnani, Violet Z. Yao, Heidi C. Zhang, Monica S. Lam

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Grounding LLM responses in a trusted corpus plus claim-level verification cuts hallucinations dramatically. That reduces misinformation risk, improves user trust, and enables deployment of smaller student models locally for lower cost and better privacy.

Who Should Care

Summary TLDR

WikiChat combines Wikipedia retrieval with LLM generation plus claim-by-claim fact-checking. The pipeline yields very high factual accuracy (≈97% with a GPT-4 teacher) on simulated and real conversations, outperforms retrieval-only baselines, and can be distilled into a 7B LLaMA student that remains factual and much faster. The method focuses on knowledge-rich chat, uses time-aware retrieval, and forces the system to say “I don’t know” when evidence is missing.

Problem Statement

LLM chatbots often produce confident but incorrect claims (hallucinations), especially on recent or rare topics. Standard retrieve-then-generate systems still hallucinate, and evaluation benchmarks focus on head (popular) knowledge and miss these weaknesses.

Main Contribution

A practical 7-stage pipeline that: (1) generates queries, (2) retrieves and filters passages, (3) asks an LLM to draft answers, (4) extracts claims, (5) verifies each claim against retrieved evidence, (6) drafts a response from verified facts, and (7) refines the response.

An implementation grounded on English Wikipedia (ColBERTv2 + PLAID for retrieval) and applied with GPT-4/GPT-3.5 and distilled to a 7B-parameter LLaMA student.

Key Findings

WikiChat (GPT-4 teacher) achieves high factual accuracy on evaluated conversations.

Numbers97.3% factual accuracy (simulated 'All')

Practical UseIf you ground LLM outputs with retrieval + per-claim verification, you can reach ≈97% factuality on open-domain Wikipedia-backed chats. Use this for knowledge-focused assistants.

Evidence RefTable 1 (All)

WikiChat outperforms ungrounded GPT-4 strongly on both simulated and real recent-topic conversations.

Numbers97.9% vs 42.9% factual (real user study); +55.0 percentage points

Practical UseFor up-to-date or rare queries, prefer a grounded pipeline over vanilla LLM calls to avoid misleading users.

Evidence RefTable 3 (user study)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyWikiChat G4 97.3%GPT-4 66.1%+31.2ppSimulated AllTable 1 (All row)Table 1
AccuracyWikiChat G4 97.9%GPT-4 42.9%+55.0ppUser study (recent topics)Table 3 and Section 8Table 3

What To Try In 7 Days

Prototype a retrieve+claim-verification loop over your company docs: extract claims from LLM drafts and verify against indexed documents.

Run a small head/tail/recent split evaluation on your domain to find blind spots where LLMs hallucinate.

Distill a heavyweight pipeline into a smaller local model by recording teacher inputs/outputs and fine-tuning a 7B model for lower latency and privacy.

Agent Features

Memory
Short-term dialog history (last 5 turns)No persistent long-term memory evaluated
Planning
Multi-step response planning via staged generation and refinement
Tool Use
Search/IR (ColBERTv2, PLAID)Claim extractionPer-claim verification
Frameworks
In-context learning (few-shot prompts) for each stageDistillation (teacher→student fine-tune on I/O pairs)
Is Agentic

Yes

Architectures
7-stage modular pipeline (query→retrieve→summarize→generate→extract→verify→refine)
Collaboration
Hybrid human+LLM evaluation pipeline (crowdworkers + GPT-4)

Optimization Features

Token Efficiency
Student sees no few-shot examples so input length is shorter (context distillation)
Infra Optimization
Local GPU serving for LLaMA models (A100), HuggingFace TGI
Model Optimization
Knowledge distillation to a 7B LLaMA student
System Optimization
Parallelize independent stages (e.g., retrieval and LLM draft steps)
Training Optimization
Multi-task fine-tuning on recorded teacher inputs/outputs (all 7 subtasks)Use of user-simulator data generation to scale examples
Inference Optimization
Fuse stages 6-7, remove chain-of-thought, parallelize API callsUse TGI + FlashAttention for faster serving

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseModel release follows original LLaMA license; code released in repo (see GitHub)

Data URLs

English Wikipedia dump used (extracted with WikiExtractor) — authors used 2023-04-28 snapshotColBERTv2 and PLAID retrieval engines (public implementations)

Risks & Boundaries

Limitations

Focused on knowledge-intensive dialogues; not evaluated for task automation or personalized chitchat.

Single-hop retrieval only; multi-hop queries not explored here.

When Not To Use

When the task is creative writing, personal tutoring, or requires initiatives beyond factual lookups.

When the knowledge source is not well covered in Wikipedia or requires multi-hop reasoning over multiple documents.

Failure Modes

Missing or incomplete retrieval results cause the system to say 'I don't know' or to omit answers.

Student models (distilled LLaMA) hallucinate more on tail and recent topics than the teacher.

Core Entities

Models

GPT-4GPT-3.5 (text-davinci-003)LLaMA (student distilled 7B, WikiChat L)Atlas (baseline retrieval model, AtlasXL 3B)

Metrics

AccuracyConversationality scores: Relevant, Informational, Natural, Non-Repetitive, Temporal (GPT-4 evaluatoLatency (time per claim / time per turn)Cost per claim

Datasets

English Wikipedia (dump used: 2023-04-28)Wizard of Wikipedia (referenced baseline dataset)KILT tasks (referenced)

Benchmarks

Simulated dialogues over head/tail/recent topic splits (authors' evaluation)User study (real conversations on recent topics)