Add explicit, verifiable rationales and reranking to RAG to cut hallucinations in biomedical QA

March 10, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Eeham Khan, Luis Rodriguez, Marc Queudot

Links

Abstract / PDF

Why It Matters For Business

Evidence-linked rationales plus reranking let smaller, cheaper LLMs reach competitive accuracy on literature QA, reducing model cost while improving explainability.

Summary TLDR

This paper presents a domain-focused retrieval-augmented generation (RAG) workflow that forces the model to produce evidence-linked rationales and then verifies each sub-claim against retrieved passages. Key modules: BM25 retrieval, BGE cross-encoder reranking, optional GPT-4o query rewriting, Llama-3-8B rationale generation, and GPT-4o-based statement-level verification. On biomedical QA (BioASQ, PubMedQA) the approach raises accuracy vs. vanilla RAG and yields competitive results compared to much larger models, with explicit failure categories to help diagnose retrieval vs. reasoning errors.

Problem Statement

Standard RAG pipelines reduce hallucination but often lack explicit, verifiable reasoning steps; this leaves high-stakes domains vulnerable because retrieved evidence can be misused or misinterpreted and errors are hard to attribute.

Main Contribution

A modular, reproducible biomedical RAG blueprint that adds rationale generation and statement-level verification.

An eight-category taxonomy for checking whether each rationale statement is supported, contradicted, irrelevant, or missing evidence.

Controlled experiments isolating reranking and dynamic in-context demonstration selection under token/latency limits, with public demo.

Key Findings

Generating explicit evidence-linked rationales improves accuracy over vanilla RAG on evaluated biomedical QA.

NumbersBioASQ: Vanilla RAG 82.3% → +Rationale 85.8% (Table 2)

Reranking retrieved passages with a BGE cross-encoder can yield large gains in few-shot settings.

NumbersPubMedQA 4-shot: w/o rerank 47.5% → with rerank 60.0% (+12.5 pts) (Table 3)

Dynamic, similarity-based selection of in-context demonstrations consistently outperforms static examples.

NumbersBioASQ 4-shot: Static 71.7% → Dynamic 86.2% (+14.5 pts) (Table 3)

A compact Llama-3-8B-Instruct with rationale+reranking matches or nears much larger model systems on tested benchmarks.

NumbersBioASQ best: 89.1% (3-shot Dynamic ICL + rerank) vs MedRAG+GPT-3.5 90.29% (Table 2)

Automated LLM-based verification tends to be more permissive than human annotators on faithfulness scores.

NumbersPilot: LLM verifier mean 0.94 vs human means 0.85 and 0.65 on 4 examples (Appendix A)

Results

Accuracy

Value89.1%

BaselineMedRAG+GPT-3.5 90.29%

Accuracy

Value73.0%

BaselineMedRAG+GPT-4 70.60%

Accuracy

Value+12.5 pts

BaselineNo rerank

Dynamic vs static demonstrations

Value+14.5 pts

BaselineStatic selection

Who Should Care

What To Try In 7 Days

Add a rationale-generation prompt that asks the model to cite passage IDs for each sub-claim.

Plug a lightweight reranker (BGE-style cross-encoder) after BM25 and feed top-5 passages to the generator.

Create a small pool of model-generated demonstrations and implement embedding-based nearest-neighbor selection for dynamic in-context examples.

Optimization Features

Token Efficiency

  • Dynamic demonstration selection to avoid context-window competition between evidence and examples

System Optimization

  • Optional query rewriting triggered when lexical overlap is low
  • Deterministic segmentation of rationales for verification

Inference Optimization

  • Use BGE reranker to reduce noisy context
  • Limit evidence to top-5 passages to save tokens

Reproducibility

License

  • CC BY 4.0 (paper text); code not provided

Data Urls

  • MedRAG toolkit (PubMed abstracts) referenced in paper
  • Demo: https://huggingface.co/spaces/DialogueRobust/RobustDialogueDemo

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation limited to two English biomedical datasets; may not generalize to other domains or languages.
  • Parts of pipeline use OpenAI APIs, adding latency and cost for deployment.
  • Human evaluation is very small (4 examples); automated verifier appears more permissive than humans.
  • Single-run results without statistical significance testing; effect sizes need replication.

When Not To Use

  • As a clinical decision system without extensive clinical validation and clinician-in-the-loop testing.
  • In domains where a high-quality retriever or corpus is not available.
  • If real-time low-latency constraints prohibit external API calls used here.

Failure Modes

  • Reranker may still surface irrelevant passages, causing incorrect grounding.
  • LLM verifier can be more permissive than humans and miss subtle unsupported inferences.
  • Label-prior bias when dynamic demonstrations are not class-balanced (noted for ternary PubMedQA).
  • Context-window tradeoffs: too many demonstrations compete with retrieved evidence.

Core Entities

Models

  • Llama-3-8B-Instruct
  • GPT-4o
  • GPT-4
  • GPT-3.5
  • BGE cross-encoder (BGE-v2-m3)

Metrics

  • Accuracy
  • Faithfulness score (proportion of supported statements)
  • Cohen's kappa
  • Per-category F1

Datasets

  • BioASQ
  • PubMedQA
  • PubMed abstracts
  • MIRAGE benchmark

Benchmarks

  • BioASQ
  • PubMedQA
  • MIRAGE