Overview
CAG converts QA data to include credibility labels and LLM-generated explanations, then instruction-fine-tunes models so they learn to prioritize high-credibility text rather than rely on retriever order.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 62%
Why It Matters For Business
RAG-powered apps break when retrievers return noisy, outdated, or fake content; training models to use simple credibility labels raises accuracy and resilience without discarding documents.
Who Should Care
Summary TLDR
RAG systems break when retrieved documents are noisy, outdated, or fake. This paper introduces CAG (Credibility-aware Generation): a data-transformation + instruction fine-tuning pipeline that (1) tags retrieved text at sentence/document levels with three credibility tiers (high/medium/low), (2) uses LLMs to generate credibility-guided explanations, and (3) fine-tunes models to generate answers that prioritize high-credibility evidence. Across a new benchmark (open-domain, time-sensitive, and misinformation scenarios) CAG models (7B/13B/Mistral-7B) substantially beat vanilla RAG baselines and stay robust as the proportion of noisy documents grows. The approach depends on retriever quality (S
Problem Statement
Retrieval-Augmented Generation helps LLMs but inherits errors from retrieval: noisy, outdated, or fake documents reduce answer correctness. Existing LLMs do not reliably use explicit credibility signals in prompts, so systems need a way to teach models to judge and weigh external evidence by credibility.
Main Contribution
Credibility-aware Generation (CAG): a general RAG design that supplies per-document credibility and trains models to use it.
A data transformation pipeline that (a) labels retrieval units at sentence/document granularity into three credibility tiers and (b) generates credibility-guided explanations via LLM prompts to create instruction-tuning data.
Key Findings
CAG-7B raises exact-match (EM) on HotpotQA compared to a LLaMA-2-7B retrieval baseline.
CAG improves robustness under heavy noise in retrieved context.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Exact Match (EM) HotpotQA | CAG-7B 0.509 | LLaMA-2-7B retrieval 0.280 | +0.229 | HotpotQA | Table 2 main results | Table 2 |
| Exact Match (EM) 2WikiMHQA | CAG-7B 0.578 | LLaMA-2-7B retrieval 0.312 | +0.266 | 2WikiMHQA | Table 2 main results | Table 2 |
What To Try In 7 Days
Add three-level credibility prefixes (high/medium/low) for retrieved docs and run offline evaluations versus your current pipeline.
Use an LLM (GPT-3.5) to generate short credibility-guided explanations for a small QA subset, then fine-tune a lightweight model for 1-3 epochs.
Measure robustness by injecting distractor or outdated docs and compare EM or other exact-answer metrics before rolling changes to production.
Optimization Features
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Performance depends on credibility annotation quality and retriever accuracy (SPLADE labels limit gains).
Requires extra data generation and fine-tuning resources; not zero-shot.
When Not To Use
If you lack reliable retrieval signals (timestamps, source metadata) to assign credibility.
When you cannot afford additional fine-tuning or labeled explanation generation.
Failure Modes
Incorrect credibility labels (from noisy retriever) can mislead the model and reduce accuracy.
Over-reliance on credibility tiers may downweight rare-but-correct low-credibility documents.

