Overview
Results are based on a focused CDC corpus and a small 10-question test; numbers show consistent trends but need larger, diverse evaluations for broader claims.
Citations2
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
Two-stage retrieval reduces factually incorrect outputs in policy applications, increasing trustworthiness at the cost of extra compute and complexity.
Who Should Care
Summary TLDR
This paper tests three setups for answering policy questions from CDC documents: a standalone LLM (Mistral-7B-Instruct), a Basic RAG (bi-encoder retrieval + context injection), and an Advanced RAG (bi-encoder followed by cross-encoder re-ranking). On a 10-question CDC testset, average faithfulness rises from 0.35 (Vanilla) to 0.62 (Basic RAG) and 0.80 (Advanced RAG). Relevance follows the same pattern (0.45 → 0.70 → 0.80). The main takeaway: a two-stage retrieval pipeline (over-retrieve with a fast bi-encoder, then re-rank candidates with a cross-encoder) materially reduces hallucinations, but document chunking remains a key bottleneck for multi-step policy reasoning.
Problem Statement
LLMs hallucinate when answering policy questions from long, structured government documents. The paper asks whether retrieval techniques—especially cross-encoder re-ranking and different chunking methods—can reliably ground answers in authoritative CDC documents and reduce hallucinations.
Main Contribution
Empirical comparison of Vanilla LLM, Basic RAG, and Advanced RAG on CDC policy Q&A.
A two-stage retrieval pipeline: fast bi-encoder over-retrieval plus cross-encoder re-ranking.
Key Findings
Advanced RAG produced the highest grounding quality.
Basic RAG reduces hallucinations versus a standalone LLM but is unstable on some queries.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Avg faithfulness | Vanilla 0.35 | Basic RAG 0.62 | Advanced RAG 0.80 | Vanilla LLM | Advanced vs Vanilla +129% relative (0.80 vs 0.35) | 10-question CDC policy QA set | Table I shows per-question and average scores | Table I |
| Avg relevance | Vanilla 0.45 | Basic RAG 0.70 | Advanced RAG 0.80 | Vanilla LLM | Advanced vs Vanilla +78% relative (0.80 vs 0.45) | 10-question CDC policy QA set | Table I averages | Table I |
What To Try In 7 Days
Plug a bi-encoder (all-MiniLM-L6-v2) into your index and inject top-k chunks into prompts to check baseline gains.
Add a cross-encoder re-ranker on top of the bi-encoder for a small candidate set (k≈10) and measure faithfulness vs latency.
Run a 10–20 question test set of domain queries to compare faithfulness and spot chunking failures.
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Small evaluation: only a 10-question test set limits generality.
Chunking remains a bottleneck—fragmented documents hurt multi-step reasoning.
When Not To Use
When strict low-latency requirements preclude re-ranking compute.
If there is no curated, authoritative corpus to retrieve from.
Failure Modes
Bi-encoder retrieves semantically similar but contextually irrelevant chunks, causing hallucinations.
Chunk splitting can break logical policy steps and derail multi-step answers.

