Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
Two-stage retrieval reduces factually incorrect outputs in policy applications, increasing trustworthiness at the cost of extra compute and complexity.
Summary TLDR
This paper tests three setups for answering policy questions from CDC documents: a standalone LLM (Mistral-7B-Instruct), a Basic RAG (bi-encoder retrieval + context injection), and an Advanced RAG (bi-encoder followed by cross-encoder re-ranking). On a 10-question CDC testset, average faithfulness rises from 0.35 (Vanilla) to 0.62 (Basic RAG) and 0.80 (Advanced RAG). Relevance follows the same pattern (0.45 → 0.70 → 0.80). The main takeaway: a two-stage retrieval pipeline (over-retrieve with a fast bi-encoder, then re-rank candidates with a cross-encoder) materially reduces hallucinations, but document chunking remains a key bottleneck for multi-step policy reasoning.
Problem Statement
LLMs hallucinate when answering policy questions from long, structured government documents. The paper asks whether retrieval techniques—especially cross-encoder re-ranking and different chunking methods—can reliably ground answers in authoritative CDC documents and reduce hallucinations.
Main Contribution
Empirical comparison of Vanilla LLM, Basic RAG, and Advanced RAG on CDC policy Q&A.
A two-stage retrieval pipeline: fast bi-encoder over-retrieval plus cross-encoder re-ranking.
Evaluation of two chunking approaches (recursive character split and token-semantic split) and their impact on faithfulness.
Quantified gains: Basic RAG improves faithfulness and relevance over Vanilla; Advanced RAG improves further with re-ranking.
Qualitative examples showing how re-ranking recovers context-critical answers and reduces hallucinations.
Key Findings
Advanced RAG produced the highest grounding quality.
Basic RAG reduces hallucinations versus a standalone LLM but is unstable on some queries.
Re-ranking trades compute for precision; cross-encoders increase relevance accuracy.
Results
Avg faithfulness
Avg relevance
Accuracy
Latency tradeoff
Who Should Care
What To Try In 7 Days
Plug a bi-encoder (all-MiniLM-L6-v2) into your index and inject top-k chunks into prompts to check baseline gains.
Add a cross-encoder re-ranker on top of the bi-encoder for a small candidate set (k≈10) and measure faithfulness vs latency.
Run a 10–20 question test set of domain queries to compare faithfulness and spot chunking failures.
Optimization Features
Token Efficiency
- Inject only top 3 re-ranked chunks into prompt
System Optimization
- Over-retrieve then filter to balance recall and precision
Inference Optimization
- Limit cross-encoder to top-k candidates to reduce compute
- Use bi-encoder for index-scale retrieval
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Small evaluation: only a 10-question test set limits generality.
- Chunking remains a bottleneck—fragmented documents hurt multi-step reasoning.
- Results are domain-specific to CDC policy documents and may not generalize.
- Cross-encoder re-ranking increases compute and latency, raising deployment cost.
- Data and code are not provided, limiting reproducibility.
When Not To Use
- When strict low-latency requirements preclude re-ranking compute.
- If there is no curated, authoritative corpus to retrieve from.
- When you cannot control or secure the retrieval pipeline for sensitive data.
Failure Modes
- Bi-encoder retrieves semantically similar but contextually irrelevant chunks, causing hallucinations.
- Chunk splitting can break logical policy steps and derail multi-step answers.
- Basic RAG shows volatile performance on edge cases without re-ranking.
Core Entities
Models
- Mistral-7B-Instruct-v0.2
- all-MiniLM-L6-v2 (bi-encoder)
- ms-marco-MiniLM-L-6-v2 (cross-encoder)
Metrics
- Faithfulness
- Relevance
- MRR@10 (cited concept)
- Latency (ms)
Datasets
- Custom CDC policy corpus (analytical frameworks and guidance)
- Custom 10-question CDC policy QA set
Benchmarks
- Custom 10-question evaluation (faithfulness, relevance)
Context Entities
Models
- Sentence-BERT (cited background)
- MS MARCO re-ranking literature (cited)
Metrics
- MRR@10 (cited for re-ranking benefits)
Benchmarks
- MS MARCO (cited for re-ranking gains)

