A retrieval+claim-verification pipeline that cuts hallucinations and can be distilled to a fast 7B model

May 23, 20239 min

Overview

Production Readiness

0.8

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

9

Authors

Sina J. Semnani, Violet Z. Yao, Heidi C. Zhang, Monica S. Lam

Links

Abstract / PDF

Why It Matters For Business

Grounding LLM responses in a trusted corpus plus claim-level verification cuts hallucinations dramatically. That reduces misinformation risk, improves user trust, and enables deployment of smaller student models locally for lower cost and better privacy.

Summary TLDR

WikiChat combines Wikipedia retrieval with LLM generation plus claim-by-claim fact-checking. The pipeline yields very high factual accuracy (≈97% with a GPT-4 teacher) on simulated and real conversations, outperforms retrieval-only baselines, and can be distilled into a 7B LLaMA student that remains factual and much faster. The method focuses on knowledge-rich chat, uses time-aware retrieval, and forces the system to say “I don’t know” when evidence is missing.

Problem Statement

LLM chatbots often produce confident but incorrect claims (hallucinations), especially on recent or rare topics. Standard retrieve-then-generate systems still hallucinate, and evaluation benchmarks focus on head (popular) knowledge and miss these weaknesses.

Main Contribution

A practical 7-stage pipeline that: (1) generates queries, (2) retrieves and filters passages, (3) asks an LLM to draft answers, (4) extracts claims, (5) verifies each claim against retrieved evidence, (6) drafts a response from verified facts, and (7) refines the response.

An implementation grounded on English Wikipedia (ColBERTv2 + PLAID for retrieval) and applied with GPT-4/GPT-3.5 and distilled to a 7B-parameter LLaMA student.

A human-and-LLM hybrid evaluation method: per-claim human fact-checking plus GPT-4 scoring for conversationality metrics, applied on head/tail/recent topic splits.

Empirical results showing large factual gains vs base LLMs and retrieval baselines, and a successful distillation that reduces latency and cost.

Key Findings

WikiChat (GPT-4 teacher) achieves high factual accuracy on evaluated conversations.

Numbers97.3% factual accuracy (simulated 'All')

WikiChat outperforms ungrounded GPT-4 strongly on both simulated and real recent-topic conversations.

Numbers97.9% vs 42.9% factual (real user study); +55.0 percentage points

A distilled 7B LLaMA student retains high factuality while reducing latency.

NumbersWikiChat L: 91.1% factual (simulated All); per-claim latency 2.3s vs teacher 7.4s (≈3.2× speedup)

The method improves factuality consistently across underlying LLMs.

NumbersAverage factuality gains over base LLM: GPT-4 +31.2pp, GPT-3.5 +27.8pp, LLaMA +50.2pp (on evaluated sets)

Base LLM factuality falls sharply on tail and recent topics, exposing benchmark blind spots.

NumbersGPT-4 drops ~38.9pp on tail and ~47.4pp on recent vs head

A substantial fraction of LLM-generated claims are removed by verification.

Numbers~33% of LLM claims rejected on average; higher rejection for tail/recent subsets

Results

Accuracy

ValueWikiChat G4 97.3%

BaselineGPT-4 66.1%

Accuracy

ValueWikiChat G4 97.9%

BaselineGPT-4 42.9%

Accuracy

ValueWikiChat L 91.1%

BaselineTeacher WikiChat G4 97.3%

Per-claim latency

ValueWikiChat G4 7.4s, WikiChat L 2.3s

BaselineGT (teacher vs student)

Cost per claim

ValueWikiChat G4 19.6¢, GPT-4 0.7¢

BaselineGPT-4 single-call baseline

Retrieval vs LLM contribution to final claims

ValueAbout 27-32% of final claims come from verified LLM outputs; rest from IR

Who Should Care

What To Try In 7 Days

Prototype a retrieve+claim-verification loop over your company docs: extract claims from LLM drafts and verify against indexed documents.

Run a small head/tail/recent split evaluation on your domain to find blind spots where LLMs hallucinate.

Distill a heavyweight pipeline into a smaller local model by recording teacher inputs/outputs and fine-tuning a 7B model for lower latency and privacy.

Agent Features

Memory

  • Short-term dialog history (last 5 turns)
  • No persistent long-term memory evaluated

Planning

  • Multi-step response planning via staged generation and refinement

Tool Use

  • Search/IR (ColBERTv2, PLAID)
  • Claim extraction
  • Per-claim verification

Frameworks

  • In-context learning (few-shot prompts) for each stage
  • Distillation (teacher→student fine-tune on I/O pairs)

Is Agentic

true

Architectures

  • 7-stage modular pipeline (query→retrieve→summarize→generate→extract→verify→refine)

Collaboration

  • Hybrid human+LLM evaluation pipeline (crowdworkers + GPT-4)

Optimization Features

Token Efficiency

  • Student sees no few-shot examples so input length is shorter (context distillation)

Infra Optimization

  • Local GPU serving for LLaMA models (A100), HuggingFace TGI

Model Optimization

  • Knowledge distillation to a 7B LLaMA student

System Optimization

  • Parallelize independent stages (e.g., retrieval and LLM draft steps)

Training Optimization

  • Multi-task fine-tuning on recorded teacher inputs/outputs (all 7 subtasks)
  • Use of user-simulator data generation to scale examples

Inference Optimization

  • Fuse stages 6-7, remove chain-of-thought, parallelize API calls
  • Use TGI + FlashAttention for faster serving

Reproducibility

License

  • Model release follows original LLaMA license; code released in repo (see GitHub)

Data Urls

  • English Wikipedia dump used (extracted with WikiExtractor) — authors used 2023-04-28 snapshot
  • ColBERTv2 and PLAID retrieval engines (public implementations)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focused on knowledge-intensive dialogues; not evaluated for task automation or personalized chitchat.
  • Single-hop retrieval only; multi-hop queries not explored here.
  • Tested only on English Wikipedia; specialized domains (medical, legal) not evaluated.
  • Teacher pipeline (WikiChat G4) has high latency and cost; requires distillation for production.

When Not To Use

  • When the task is creative writing, personal tutoring, or requires initiatives beyond factual lookups.
  • When the knowledge source is not well covered in Wikipedia or requires multi-hop reasoning over multiple documents.
  • If you need ultra-low-latency on cloud LLMs without local serving; teacher pipeline is slow.

Failure Modes

  • Missing or incomplete retrieval results cause the system to say 'I don't know' or to omit answers.
  • Student models (distilled LLaMA) hallucinate more on tail and recent topics than the teacher.
  • Errors if the indexed corpus is outdated or missing domain-specific facts.
  • Higher runtime cost for teacher pipeline due to many LLM calls.

Core Entities

Models

  • GPT-4
  • GPT-3.5 (text-davinci-003)
  • LLaMA (student distilled 7B, WikiChat L)
  • Atlas (baseline retrieval model, AtlasXL 3B)

Metrics

  • Accuracy
  • Conversationality scores: Relevant, Informational, Natural, Non-Repetitive, Temporal (GPT-4 evaluato
  • Latency (time per claim / time per turn)
  • Cost per claim

Datasets

  • English Wikipedia (dump used: 2023-04-28)
  • Wizard of Wikipedia (referenced baseline dataset)
  • KILT tasks (referenced)

Benchmarks

  • Simulated dialogues over head/tail/recent topic splits (authors' evaluation)
  • User study (real conversations on recent topics)