Overview
Production Readiness
0.6
Novelty Score
0.62
Cost Impact Score
0.5
Citation Count
2
Why It Matters For Business
RAG-powered apps break when retrievers return noisy, outdated, or fake content; training models to use simple credibility labels raises accuracy and resilience without discarding documents.
Summary TLDR
RAG systems break when retrieved documents are noisy, outdated, or fake. This paper introduces CAG (Credibility-aware Generation): a data-transformation + instruction fine-tuning pipeline that (1) tags retrieved text at sentence/document levels with three credibility tiers (high/medium/low), (2) uses LLMs to generate credibility-guided explanations, and (3) fine-tunes models to generate answers that prioritize high-credibility evidence. Across a new benchmark (open-domain, time-sensitive, and misinformation scenarios) CAG models (7B/13B/Mistral-7B) substantially beat vanilla RAG baselines and stay robust as the proportion of noisy documents grows. The approach depends on retriever quality (S
Problem Statement
Retrieval-Augmented Generation helps LLMs but inherits errors from retrieval: noisy, outdated, or fake documents reduce answer correctness. Existing LLMs do not reliably use explicit credibility signals in prompts, so systems need a way to teach models to judge and weigh external evidence by credibility.
Main Contribution
Credibility-aware Generation (CAG): a general RAG design that supplies per-document credibility and trains models to use it.
A data transformation pipeline that (a) labels retrieval units at sentence/document granularity into three credibility tiers and (b) generates credibility-guided explanations via LLM prompts to create instruction-tuning data.
CAGB: a new benchmark covering open-domain QA, time-sensitive QA, and misinformation-polluted QA to measure credibility-aware generation and robustness to noisy retrieval.
Empirical results showing CAG models outperform standard RAG baselines across seven datasets and remain more stable as noise increases.
Key Findings
CAG-7B raises exact-match (EM) on HotpotQA compared to a LLaMA-2-7B retrieval baseline.
CAG improves robustness under heavy noise in retrieved context.
Credibility annotation accuracy strongly limits performance.
Results
Exact Match (EM) HotpotQA
Exact Match (EM) 2WikiMHQA
Exact Match (EM) EvolvingTempQA
Exact Match (EM) NewsPollutedQA
Noise robustness (RGB @ noise 0.8)
Who Should Care
What To Try In 7 Days
Add three-level credibility prefixes (high/medium/low) for retrieved docs and run offline evaluations versus your current pipeline.
Use an LLM (GPT-3.5) to generate short credibility-guided explanations for a small QA subset, then fine-tune a lightweight model for 1-3 epochs.
Measure robustness by injecting distractor or outdated docs and compare EM or other exact-answer metrics before rolling changes to production.
Optimization Features
Training Optimization
- instruction fine-tuning on credibility-annotated data
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Performance depends on credibility annotation quality and retriever accuracy (SPLADE labels limit gains).
- Requires extra data generation and fine-tuning resources; not zero-shot.
- Current credibility scheme uses coarse three-level labels; fine-grained labels were less robust in experiments.
When Not To Use
- If you lack reliable retrieval signals (timestamps, source metadata) to assign credibility.
- When you cannot afford additional fine-tuning or labeled explanation generation.
- If latency/size constraints prevent hosting a tuned model for production.
Failure Modes
- Incorrect credibility labels (from noisy retriever) can mislead the model and reduce accuracy.
- Over-reliance on credibility tiers may downweight rare-but-correct low-credibility documents.
- Mismatched credibility definitions for a domain (e.g., proprietary internal sources) can harm outputs.
Core Entities
Models
- CAG-7B
- CAG-13B
- CAG-mistral-7B
- LLaMA-2-7B
- LLaMA-2-13B
- LLaMA-2-70B
- Vicuna-7B
- Mistral-7B-Instruct
- ChatGPT (gpt-3.5-turbo-0613)
Metrics
- Exact Match (EM)
Datasets
- HotpotQA
- 2WikiMHQA
- MuSiQue
- ASQA
- RGB
- EvolvingTempQA (time-sensitive)
- NewsPollutedQA (misinformation-polluted)
- ShareGPT
- ELI5
- QAMPARI
- WikiQA
- NewsQA
- PubMedQA
Benchmarks
- CAGB (Credibility-aware Generation Benchmark)

