Cross-encoder re-ranking boosts faithfulness of RAG for CDC policy Q&A

Overview

Decision SnapshotNeeds Validation

Results are based on a focused CDC corpus and a small 10-question test; numbers show consistent trends but need larger, diverse evaluations for broader claims.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Anuj Maharjan, Umesh Yadav

Links

Abstract / PDF

Why It Matters For Business

Two-stage retrieval reduces factually incorrect outputs in policy applications, increasing trustworthiness at the cost of extra compute and complexity.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

This paper tests three setups for answering policy questions from CDC documents: a standalone LLM (Mistral-7B-Instruct), a Basic RAG (bi-encoder retrieval + context injection), and an Advanced RAG (bi-encoder followed by cross-encoder re-ranking). On a 10-question CDC testset, average faithfulness rises from 0.35 (Vanilla) to 0.62 (Basic RAG) and 0.80 (Advanced RAG). Relevance follows the same pattern (0.45 → 0.70 → 0.80). The main takeaway: a two-stage retrieval pipeline (over-retrieve with a fast bi-encoder, then re-rank candidates with a cross-encoder) materially reduces hallucinations, but document chunking remains a key bottleneck for multi-step policy reasoning.

Problem Statement

LLMs hallucinate when answering policy questions from long, structured government documents. The paper asks whether retrieval techniques—especially cross-encoder re-ranking and different chunking methods—can reliably ground answers in authoritative CDC documents and reduce hallucinations.

Main Contribution

Empirical comparison of Vanilla LLM, Basic RAG, and Advanced RAG on CDC policy Q&A.

A two-stage retrieval pipeline: fast bi-encoder over-retrieval plus cross-encoder re-ranking.

Key Findings

Advanced RAG produced the highest grounding quality.

NumbersAvg faithfulness: Vanilla 0.35 → Basic 0.62 → Advanced 0.80 (Table I)

Practical UseUse a two-stage retrieval pipeline (bi-encoder + cross-encoder) to markedly reduce hallucinations on policy QA.

Evidence RefTable I, Fig.2

Basic RAG reduces hallucinations versus a standalone LLM but is unstable on some queries.

NumbersBasic RAG improved faithfulness by ~79% and relevance by ~55% over Vanilla (per paper text)

Practical UseIf compute for cross-encoders is limited, Basic RAG still helps, but expect volatility on edge or highly specific queries.

Evidence RefSection IV.C and Table I

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Avg faithfulness	Vanilla 0.35 \| Basic RAG 0.62 \| Advanced RAG 0.80	Vanilla LLM	Advanced vs Vanilla +129% relative (0.80 vs 0.35)	10-question CDC policy QA set	Table I shows per-question and average scores	Table I
Avg relevance	Vanilla 0.45 \| Basic RAG 0.70 \| Advanced RAG 0.80	Vanilla LLM	Advanced vs Vanilla +78% relative (0.80 vs 0.45)	10-question CDC policy QA set	Table I averages	Table I

What To Try In 7 Days

Plug a bi-encoder (all-MiniLM-L6-v2) into your index and inject top-k chunks into prompts to check baseline gains.

Add a cross-encoder re-ranker on top of the bi-encoder for a small candidate set (k≈10) and measure faithfulness vs latency.

Run a 10–20 question test set of domain queries to compare faithfulness and spot chunking failures.

Optimization Features

Token Efficiency

Inject only top 3 re-ranked chunks into prompt

System Optimization

Over-retrieve then filter to balance recall and precision

Inference Optimization

Limit cross-encoder to top-k candidates to reduce computeUse bi-encoder for index-scale retrieval

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Small evaluation: only a 10-question test set limits generality.

Chunking remains a bottleneck—fragmented documents hurt multi-step reasoning.

When Not To Use

When strict low-latency requirements preclude re-ranking compute.

If there is no curated, authoritative corpus to retrieve from.

Failure Modes

Bi-encoder retrieves semantically similar but contextually irrelevant chunks, causing hallucinations.

Chunk splitting can break logical policy steps and derail multi-step answers.

Core Entities

Models

Mistral-7B-Instruct-v0.2all-MiniLM-L6-v2 (bi-encoder)ms-marco-MiniLM-L-6-v2 (cross-encoder)

Metrics

FaithfulnessRelevanceMRR@10 (cited concept)Latency (ms)

Datasets

Custom CDC policy corpus (analytical frameworks and guidance)Custom 10-question CDC policy QA set

Benchmarks

Custom 10-question evaluation (faithfulness, relevance)

Context Entities

Models

Sentence-BERT (cited background)MS MARCO re-ranking literature (cited)

Metrics

MRR@10 (cited for re-ranking benefits)

Benchmarks

MS MARCO (cited for re-ranking gains)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Advanced RAG produced the highest grounding quality.

Basic RAG reduces hallucinations versus a standalone LLM but is unstable on some queries.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Benchmarks

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding