Teach retrieval-augmented LMs to read and weigh sources by credibility so outputs stay correct under noisy or outdated retrievals

April 10, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.62

Cost Impact Score

0.5

Citation Count

2

Authors

Ruotong Pan, Boxi Cao, Hongyu Lin, Xianpei Han, Jia Zheng, Sirui Wang, Xunliang Cai, Le Sun

Links

Abstract / PDF

Why It Matters For Business

RAG-powered apps break when retrievers return noisy, outdated, or fake content; training models to use simple credibility labels raises accuracy and resilience without discarding documents.

Summary TLDR

RAG systems break when retrieved documents are noisy, outdated, or fake. This paper introduces CAG (Credibility-aware Generation): a data-transformation + instruction fine-tuning pipeline that (1) tags retrieved text at sentence/document levels with three credibility tiers (high/medium/low), (2) uses LLMs to generate credibility-guided explanations, and (3) fine-tunes models to generate answers that prioritize high-credibility evidence. Across a new benchmark (open-domain, time-sensitive, and misinformation scenarios) CAG models (7B/13B/Mistral-7B) substantially beat vanilla RAG baselines and stay robust as the proportion of noisy documents grows. The approach depends on retriever quality (S

Problem Statement

Retrieval-Augmented Generation helps LLMs but inherits errors from retrieval: noisy, outdated, or fake documents reduce answer correctness. Existing LLMs do not reliably use explicit credibility signals in prompts, so systems need a way to teach models to judge and weigh external evidence by credibility.

Main Contribution

Credibility-aware Generation (CAG): a general RAG design that supplies per-document credibility and trains models to use it.

A data transformation pipeline that (a) labels retrieval units at sentence/document granularity into three credibility tiers and (b) generates credibility-guided explanations via LLM prompts to create instruction-tuning data.

CAGB: a new benchmark covering open-domain QA, time-sensitive QA, and misinformation-polluted QA to measure credibility-aware generation and robustness to noisy retrieval.

Empirical results showing CAG models outperform standard RAG baselines across seven datasets and remain more stable as noise increases.

Key Findings

CAG-7B raises exact-match (EM) on HotpotQA compared to a LLaMA-2-7B retrieval baseline.

NumbersHotpotQA EM: LLaMA-2-7B 0.280 -> CAG-7B 0.509 (+0.229)

CAG improves robustness under heavy noise in retrieved context.

NumbersRGB @ noise 0.8: ChatGPT retrieval 0.773 -> CAG-13B 0.917 (+0.144)

Credibility annotation accuracy strongly limits performance.

NumbersUsing golden credibility labels yields +14.4% average EM for CAG-7B across three datasets vs SPLADE labels

Results

Exact Match (EM) HotpotQA

ValueCAG-7B 0.509

BaselineLLaMA-2-7B retrieval 0.280

Exact Match (EM) 2WikiMHQA

ValueCAG-7B 0.578

BaselineLLaMA-2-7B retrieval 0.312

Exact Match (EM) EvolvingTempQA

ValueCAG-7B 0.826

BaselineLLaMA-2-7B retrieval 0.433

Exact Match (EM) NewsPollutedQA

ValueCAG-7B 0.442

BaselineLLaMA-2-7B retrieval 0.179

Noise robustness (RGB @ noise 0.8)

ValueCAG-13B 0.917

BaselineChatGPT retrieval 0.773

Who Should Care

What To Try In 7 Days

Add three-level credibility prefixes (high/medium/low) for retrieved docs and run offline evaluations versus your current pipeline.

Use an LLM (GPT-3.5) to generate short credibility-guided explanations for a small QA subset, then fine-tune a lightweight model for 1-3 epochs.

Measure robustness by injecting distractor or outdated docs and compare EM or other exact-answer metrics before rolling changes to production.

Optimization Features

Training Optimization

  • instruction fine-tuning on credibility-annotated data

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Performance depends on credibility annotation quality and retriever accuracy (SPLADE labels limit gains).
  • Requires extra data generation and fine-tuning resources; not zero-shot.
  • Current credibility scheme uses coarse three-level labels; fine-grained labels were less robust in experiments.

When Not To Use

  • If you lack reliable retrieval signals (timestamps, source metadata) to assign credibility.
  • When you cannot afford additional fine-tuning or labeled explanation generation.
  • If latency/size constraints prevent hosting a tuned model for production.

Failure Modes

  • Incorrect credibility labels (from noisy retriever) can mislead the model and reduce accuracy.
  • Over-reliance on credibility tiers may downweight rare-but-correct low-credibility documents.
  • Mismatched credibility definitions for a domain (e.g., proprietary internal sources) can harm outputs.

Core Entities

Models

  • CAG-7B
  • CAG-13B
  • CAG-mistral-7B
  • LLaMA-2-7B
  • LLaMA-2-13B
  • LLaMA-2-70B
  • Vicuna-7B
  • Mistral-7B-Instruct
  • ChatGPT (gpt-3.5-turbo-0613)

Metrics

  • Exact Match (EM)

Datasets

  • HotpotQA
  • 2WikiMHQA
  • MuSiQue
  • ASQA
  • RGB
  • EvolvingTempQA (time-sensitive)
  • NewsPollutedQA (misinformation-polluted)
  • ShareGPT
  • ELI5
  • QAMPARI
  • WikiQA
  • NewsQA
  • PubMedQA

Benchmarks

  • CAGB (Credibility-aware Generation Benchmark)