Teach retrieval-augmented LMs to read and weigh sources by credibility so outputs stay correct under noisy or outdated retrievals

April 10, 20247 min

Overview

Decision SnapshotReady For Pilot

CAG converts QA data to include credibility labels and LLM-generated explanations, then instruction-fine-tunes models so they learn to prioritize high-credibility text rather than rely on retriever order.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 62%

Authors

Ruotong Pan, Boxi Cao, Hongyu Lin, Xianpei Han, Jia Zheng, Sirui Wang, Xunliang Cai, Le Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAG-powered apps break when retrievers return noisy, outdated, or fake content; training models to use simple credibility labels raises accuracy and resilience without discarding documents.

Who Should Care

Summary TLDR

RAG systems break when retrieved documents are noisy, outdated, or fake. This paper introduces CAG (Credibility-aware Generation): a data-transformation + instruction fine-tuning pipeline that (1) tags retrieved text at sentence/document levels with three credibility tiers (high/medium/low), (2) uses LLMs to generate credibility-guided explanations, and (3) fine-tunes models to generate answers that prioritize high-credibility evidence. Across a new benchmark (open-domain, time-sensitive, and misinformation scenarios) CAG models (7B/13B/Mistral-7B) substantially beat vanilla RAG baselines and stay robust as the proportion of noisy documents grows. The approach depends on retriever quality (S

Problem Statement

Retrieval-Augmented Generation helps LLMs but inherits errors from retrieval: noisy, outdated, or fake documents reduce answer correctness. Existing LLMs do not reliably use explicit credibility signals in prompts, so systems need a way to teach models to judge and weigh external evidence by credibility.

Main Contribution

Credibility-aware Generation (CAG): a general RAG design that supplies per-document credibility and trains models to use it.

A data transformation pipeline that (a) labels retrieval units at sentence/document granularity into three credibility tiers and (b) generates credibility-guided explanations via LLM prompts to create instruction-tuning data.

Key Findings

CAG-7B raises exact-match (EM) on HotpotQA compared to a LLaMA-2-7B retrieval baseline.

NumbersHotpotQA EM: LLaMA-2-7B 0.280 -> CAG-7B 0.509 (+0.229)

Practical UseFine-tuning with credibility labels can more than double gains from naive retrieval on multi-hop QA; add credibility-aware training to improve multi-doc reasoning.

Evidence RefTable 2 (HotpotQA retrieval-based row and CAG-7B row)

CAG improves robustness under heavy noise in retrieved context.

NumbersRGB @ noise 0.8: ChatGPT retrieval 0.773 -> CAG-13B 0.917 (+0.144)

Practical UseWhen your retriever returns many distractors, CAG-style credibility guidance keeps accuracy higher than reranking or prompt-prefix credibility.

Evidence RefTable 11 (RGB noise-robustness results) and Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Exact Match (EM) HotpotQACAG-7B 0.509LLaMA-2-7B retrieval 0.280+0.229HotpotQATable 2 main resultsTable 2
Exact Match (EM) 2WikiMHQACAG-7B 0.578LLaMA-2-7B retrieval 0.312+0.2662WikiMHQATable 2 main resultsTable 2

What To Try In 7 Days

Add three-level credibility prefixes (high/medium/low) for retrieved docs and run offline evaluations versus your current pipeline.

Use an LLM (GPT-3.5) to generate short credibility-guided explanations for a small QA subset, then fine-tune a lightweight model for 1-3 epochs.

Measure robustness by injecting distractor or outdated docs and compare EM or other exact-answer metrics before rolling changes to production.

Optimization Features

Training Optimization
instruction fine-tuning on credibility-annotated data

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Performance depends on credibility annotation quality and retriever accuracy (SPLADE labels limit gains).

Requires extra data generation and fine-tuning resources; not zero-shot.

When Not To Use

If you lack reliable retrieval signals (timestamps, source metadata) to assign credibility.

When you cannot afford additional fine-tuning or labeled explanation generation.

Failure Modes

Incorrect credibility labels (from noisy retriever) can mislead the model and reduce accuracy.

Over-reliance on credibility tiers may downweight rare-but-correct low-credibility documents.

Core Entities

Models

CAG-7BCAG-13BCAG-mistral-7BLLaMA-2-7BLLaMA-2-13BLLaMA-2-70BVicuna-7BMistral-7B-InstructChatGPT (gpt-3.5-turbo-0613)

Metrics

Exact Match (EM)

Datasets

HotpotQA2WikiMHQAMuSiQueASQARGBEvolvingTempQA (time-sensitive)NewsPollutedQA (misinformation-polluted)ShareGPTELI5QAMPARIWikiQANewsQAPubMedQA

Benchmarks

CAGB (Credibility-aware Generation Benchmark)