Cross-encoder re-ranking boosts faithfulness of RAG for CDC policy Q&A

January 21, 20267 min

Overview

Decision SnapshotNeeds Validation

Results are based on a focused CDC corpus and a small 10-question test; numbers show consistent trends but need larger, diverse evaluations for broader claims.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Anuj Maharjan, Umesh Yadav

Links

Abstract / PDF

Why It Matters For Business

Two-stage retrieval reduces factually incorrect outputs in policy applications, increasing trustworthiness at the cost of extra compute and complexity.

Who Should Care

Summary TLDR

This paper tests three setups for answering policy questions from CDC documents: a standalone LLM (Mistral-7B-Instruct), a Basic RAG (bi-encoder retrieval + context injection), and an Advanced RAG (bi-encoder followed by cross-encoder re-ranking). On a 10-question CDC testset, average faithfulness rises from 0.35 (Vanilla) to 0.62 (Basic RAG) and 0.80 (Advanced RAG). Relevance follows the same pattern (0.45 → 0.70 → 0.80). The main takeaway: a two-stage retrieval pipeline (over-retrieve with a fast bi-encoder, then re-rank candidates with a cross-encoder) materially reduces hallucinations, but document chunking remains a key bottleneck for multi-step policy reasoning.

Problem Statement

LLMs hallucinate when answering policy questions from long, structured government documents. The paper asks whether retrieval techniques—especially cross-encoder re-ranking and different chunking methods—can reliably ground answers in authoritative CDC documents and reduce hallucinations.

Main Contribution

Empirical comparison of Vanilla LLM, Basic RAG, and Advanced RAG on CDC policy Q&A.

A two-stage retrieval pipeline: fast bi-encoder over-retrieval plus cross-encoder re-ranking.

Key Findings

Advanced RAG produced the highest grounding quality.

NumbersAvg faithfulness: Vanilla 0.35 → Basic 0.62 → Advanced 0.80 (Table I)

Practical UseUse a two-stage retrieval pipeline (bi-encoder + cross-encoder) to markedly reduce hallucinations on policy QA.

Evidence RefTable I, Fig.2

Basic RAG reduces hallucinations versus a standalone LLM but is unstable on some queries.

NumbersBasic RAG improved faithfulness by ~79% and relevance by ~55% over Vanilla (per paper text)

Practical UseIf compute for cross-encoders is limited, Basic RAG still helps, but expect volatility on edge or highly specific queries.

Evidence RefSection IV.C and Table I

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Avg faithfulnessVanilla 0.35 | Basic RAG 0.62 | Advanced RAG 0.80Vanilla LLMAdvanced vs Vanilla +129% relative (0.80 vs 0.35)10-question CDC policy QA setTable I shows per-question and average scoresTable I
Avg relevanceVanilla 0.45 | Basic RAG 0.70 | Advanced RAG 0.80Vanilla LLMAdvanced vs Vanilla +78% relative (0.80 vs 0.45)10-question CDC policy QA setTable I averagesTable I

What To Try In 7 Days

Plug a bi-encoder (all-MiniLM-L6-v2) into your index and inject top-k chunks into prompts to check baseline gains.

Add a cross-encoder re-ranker on top of the bi-encoder for a small candidate set (k≈10) and measure faithfulness vs latency.

Run a 10–20 question test set of domain queries to compare faithfulness and spot chunking failures.

Optimization Features

Token Efficiency
Inject only top 3 re-ranked chunks into prompt
System Optimization
Over-retrieve then filter to balance recall and precision
Inference Optimization
Limit cross-encoder to top-k candidates to reduce computeUse bi-encoder for index-scale retrieval

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Small evaluation: only a 10-question test set limits generality.

Chunking remains a bottleneck—fragmented documents hurt multi-step reasoning.

When Not To Use

When strict low-latency requirements preclude re-ranking compute.

If there is no curated, authoritative corpus to retrieve from.

Failure Modes

Bi-encoder retrieves semantically similar but contextually irrelevant chunks, causing hallucinations.

Chunk splitting can break logical policy steps and derail multi-step answers.

Core Entities

Models

Mistral-7B-Instruct-v0.2all-MiniLM-L6-v2 (bi-encoder)ms-marco-MiniLM-L-6-v2 (cross-encoder)

Metrics

FaithfulnessRelevanceMRR@10 (cited concept)Latency (ms)

Datasets

Custom CDC policy corpus (analytical frameworks and guidance)Custom 10-question CDC policy QA set

Benchmarks

Custom 10-question evaluation (faithfulness, relevance)

Context Entities

Models

Sentence-BERT (cited background)MS MARCO re-ranking literature (cited)

Metrics

MRR@10 (cited for re-ranking benefits)

Benchmarks

MS MARCO (cited for re-ranking gains)