Cross-encoder re-ranking boosts faithfulness of RAG for CDC policy Q&A

January 21, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

2

Authors

Anuj Maharjan, Umesh Yadav

Links

Abstract / PDF

Why It Matters For Business

Two-stage retrieval reduces factually incorrect outputs in policy applications, increasing trustworthiness at the cost of extra compute and complexity.

Summary TLDR

This paper tests three setups for answering policy questions from CDC documents: a standalone LLM (Mistral-7B-Instruct), a Basic RAG (bi-encoder retrieval + context injection), and an Advanced RAG (bi-encoder followed by cross-encoder re-ranking). On a 10-question CDC testset, average faithfulness rises from 0.35 (Vanilla) to 0.62 (Basic RAG) and 0.80 (Advanced RAG). Relevance follows the same pattern (0.45 → 0.70 → 0.80). The main takeaway: a two-stage retrieval pipeline (over-retrieve with a fast bi-encoder, then re-rank candidates with a cross-encoder) materially reduces hallucinations, but document chunking remains a key bottleneck for multi-step policy reasoning.

Problem Statement

LLMs hallucinate when answering policy questions from long, structured government documents. The paper asks whether retrieval techniques—especially cross-encoder re-ranking and different chunking methods—can reliably ground answers in authoritative CDC documents and reduce hallucinations.

Main Contribution

Empirical comparison of Vanilla LLM, Basic RAG, and Advanced RAG on CDC policy Q&A.

A two-stage retrieval pipeline: fast bi-encoder over-retrieval plus cross-encoder re-ranking.

Evaluation of two chunking approaches (recursive character split and token-semantic split) and their impact on faithfulness.

Quantified gains: Basic RAG improves faithfulness and relevance over Vanilla; Advanced RAG improves further with re-ranking.

Qualitative examples showing how re-ranking recovers context-critical answers and reduces hallucinations.

Key Findings

Advanced RAG produced the highest grounding quality.

NumbersAvg faithfulness: Vanilla 0.35 → Basic 0.62 → Advanced 0.80 (Table I)

Basic RAG reduces hallucinations versus a standalone LLM but is unstable on some queries.

NumbersBasic RAG improved faithfulness by ~79% and relevance by ~55% over Vanilla (per paper text)

Re-ranking trades compute for precision; cross-encoders increase relevance accuracy.

NumbersBi-encoder relevance 65–80% vs cross-encoder 85–90%; latency ∼15ms/1M docs vs 50–150ms/20 docs (Table III)

Results

Avg faithfulness

ValueVanilla 0.35 | Basic RAG 0.62 | Advanced RAG 0.80

BaselineVanilla LLM

Avg relevance

ValueVanilla 0.45 | Basic RAG 0.70 | Advanced RAG 0.80

BaselineVanilla LLM

Accuracy

ValueBi-encoder 65–80% | Cross-encoder 85–90%

BaselineBi-encoder

Latency tradeoff

ValueBi-encoder ∼15ms per 1M docs | Cross-encoder 50–150ms per 20 docs

BaselineBi-encoder

Who Should Care

What To Try In 7 Days

Plug a bi-encoder (all-MiniLM-L6-v2) into your index and inject top-k chunks into prompts to check baseline gains.

Add a cross-encoder re-ranker on top of the bi-encoder for a small candidate set (k≈10) and measure faithfulness vs latency.

Run a 10–20 question test set of domain queries to compare faithfulness and spot chunking failures.

Optimization Features

Token Efficiency

  • Inject only top 3 re-ranked chunks into prompt

System Optimization

  • Over-retrieve then filter to balance recall and precision

Inference Optimization

  • Limit cross-encoder to top-k candidates to reduce compute
  • Use bi-encoder for index-scale retrieval

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Small evaluation: only a 10-question test set limits generality.
  • Chunking remains a bottleneck—fragmented documents hurt multi-step reasoning.
  • Results are domain-specific to CDC policy documents and may not generalize.
  • Cross-encoder re-ranking increases compute and latency, raising deployment cost.
  • Data and code are not provided, limiting reproducibility.

When Not To Use

  • When strict low-latency requirements preclude re-ranking compute.
  • If there is no curated, authoritative corpus to retrieve from.
  • When you cannot control or secure the retrieval pipeline for sensitive data.

Failure Modes

  • Bi-encoder retrieves semantically similar but contextually irrelevant chunks, causing hallucinations.
  • Chunk splitting can break logical policy steps and derail multi-step answers.
  • Basic RAG shows volatile performance on edge cases without re-ranking.

Core Entities

Models

  • Mistral-7B-Instruct-v0.2
  • all-MiniLM-L6-v2 (bi-encoder)
  • ms-marco-MiniLM-L-6-v2 (cross-encoder)

Metrics

  • Faithfulness
  • Relevance
  • MRR@10 (cited concept)
  • Latency (ms)

Datasets

  • Custom CDC policy corpus (analytical frameworks and guidance)
  • Custom 10-question CDC policy QA set

Benchmarks

  • Custom 10-question evaluation (faithfulness, relevance)

Context Entities

Models

  • Sentence-BERT (cited background)
  • MS MARCO re-ranking literature (cited)

Metrics

  • MRR@10 (cited for re-ranking benefits)

Benchmarks

  • MS MARCO (cited for re-ranking gains)