Teach retrieval-augmented LMs to read and weigh sources by credibility so outputs stay correct under noisy or outdated retrievals

Overview

Decision SnapshotReady For Pilot

CAG converts QA data to include credibility labels and LLM-generated explanations, then instruction-fine-tunes models so they learn to prioritize high-credibility text rather than rely on retriever order.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 62%

Authors

Ruotong Pan, Boxi Cao, Hongyu Lin, Xianpei Han, Jia Zheng, Sirui Wang, Xunliang Cai, Le Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAG-powered apps break when retrievers return noisy, outdated, or fake content; training models to use simple credibility labels raises accuracy and resilience without discarding documents.

Who Should Care

Product Manager ML Engineer CTO Founder Data Scientist

Summary TLDR

RAG systems break when retrieved documents are noisy, outdated, or fake. This paper introduces CAG (Credibility-aware Generation): a data-transformation + instruction fine-tuning pipeline that (1) tags retrieved text at sentence/document levels with three credibility tiers (high/medium/low), (2) uses LLMs to generate credibility-guided explanations, and (3) fine-tunes models to generate answers that prioritize high-credibility evidence. Across a new benchmark (open-domain, time-sensitive, and misinformation scenarios) CAG models (7B/13B/Mistral-7B) substantially beat vanilla RAG baselines and stay robust as the proportion of noisy documents grows. The approach depends on retriever quality (S

Problem Statement

Retrieval-Augmented Generation helps LLMs but inherits errors from retrieval: noisy, outdated, or fake documents reduce answer correctness. Existing LLMs do not reliably use explicit credibility signals in prompts, so systems need a way to teach models to judge and weigh external evidence by credibility.

Main Contribution

Credibility-aware Generation (CAG): a general RAG design that supplies per-document credibility and trains models to use it.

A data transformation pipeline that (a) labels retrieval units at sentence/document granularity into three credibility tiers and (b) generates credibility-guided explanations via LLM prompts to create instruction-tuning data.

Key Findings

CAG-7B raises exact-match (EM) on HotpotQA compared to a LLaMA-2-7B retrieval baseline.

NumbersHotpotQA EM: LLaMA-2-7B 0.280 -> CAG-7B 0.509 (+0.229)

Practical UseFine-tuning with credibility labels can more than double gains from naive retrieval on multi-hop QA; add credibility-aware training to improve multi-doc reasoning.

Evidence RefTable 2 (HotpotQA retrieval-based row and CAG-7B row)

CAG improves robustness under heavy noise in retrieved context.

NumbersRGB @ noise 0.8: ChatGPT retrieval 0.773 -> CAG-13B 0.917 (+0.144)

Practical UseWhen your retriever returns many distractors, CAG-style credibility guidance keeps accuracy higher than reranking or prompt-prefix credibility.

Evidence RefTable 11 (RGB noise-robustness results) and Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Exact Match (EM) HotpotQA	CAG-7B 0.509	LLaMA-2-7B retrieval 0.280	+0.229	HotpotQA	Table 2 main results	Table 2
Exact Match (EM) 2WikiMHQA	CAG-7B 0.578	LLaMA-2-7B retrieval 0.312	+0.266	2WikiMHQA	Table 2 main results	Table 2

What To Try In 7 Days

Add three-level credibility prefixes (high/medium/low) for retrieved docs and run offline evaluations versus your current pipeline.

Use an LLM (GPT-3.5) to generate short credibility-guided explanations for a small QA subset, then fine-tune a lightweight model for 1-3 epochs.

Measure robustness by injecting distractor or outdated docs and compare EM or other exact-answer metrics before rolling changes to production.

Optimization Features

Training Optimization

instruction fine-tuning on credibility-annotated data

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/panruotong/CAG

Data URLs

https://github.com/panruotong/CAG

Risks & Boundaries

Limitations

Performance depends on credibility annotation quality and retriever accuracy (SPLADE labels limit gains).

Requires extra data generation and fine-tuning resources; not zero-shot.

When Not To Use

If you lack reliable retrieval signals (timestamps, source metadata) to assign credibility.

When you cannot afford additional fine-tuning or labeled explanation generation.

Failure Modes

Incorrect credibility labels (from noisy retriever) can mislead the model and reduce accuracy.

Over-reliance on credibility tiers may downweight rare-but-correct low-credibility documents.

Core Entities

Models

CAG-7BCAG-13BCAG-mistral-7BLLaMA-2-7BLLaMA-2-13BLLaMA-2-70BVicuna-7BMistral-7B-InstructChatGPT (gpt-3.5-turbo-0613)

Metrics

Exact Match (EM)

Datasets

HotpotQA2WikiMHQAMuSiQueASQARGBEvolvingTempQA (time-sensitive)NewsPollutedQA (misinformation-polluted)ShareGPTELI5QAMPARIWikiQANewsQAPubMedQA

Benchmarks

CAGB (Credibility-aware Generation Benchmark)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CAG-7B raises exact-match (EM) on HotpotQA compared to a LLaMA-2-7B retrieval baseline.

CAG improves robustness under heavy noise in retrieved context.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Case-aware LLM-as-a-judge scoring: eight enterprise metrics, severity-weighting, and JSON outputs for multi-turn RAG

Key finding

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

Key finding

Curate systematic reviews + guidelines to make RAG answers more trustworthy for Long COVID

Key finding

Mask untruthful parts of context to cut hallucinations and keep helpful facts

Key finding

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

Key finding