Add explicit, verifiable rationales and reranking to RAG to cut hallucinations in biomedical QA

March 10, 20268 min

Overview

Decision SnapshotNeeds Validation

The paper provides controlled ablations and clear metrics, but human evaluation is tiny and no statistical testing was done; treat results as promising prototype-level evidence.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

License: CC BY 4.0 (paper text); code not provided

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Eeham Khan, Luis Rodriguez, Marc Queudot

Links

Abstract / PDF / Data

Why It Matters For Business

Evidence-linked rationales plus reranking let smaller, cheaper LLMs reach competitive accuracy on literature QA, reducing model cost while improving explainability.

Who Should Care

Summary TLDR

This paper presents a domain-focused retrieval-augmented generation (RAG) workflow that forces the model to produce evidence-linked rationales and then verifies each sub-claim against retrieved passages. Key modules: BM25 retrieval, BGE cross-encoder reranking, optional GPT-4o query rewriting, Llama-3-8B rationale generation, and GPT-4o-based statement-level verification. On biomedical QA (BioASQ, PubMedQA) the approach raises accuracy vs. vanilla RAG and yields competitive results compared to much larger models, with explicit failure categories to help diagnose retrieval vs. reasoning errors.

Problem Statement

Standard RAG pipelines reduce hallucination but often lack explicit, verifiable reasoning steps; this leaves high-stakes domains vulnerable because retrieved evidence can be misused or misinterpreted and errors are hard to attribute.

Main Contribution

A modular, reproducible biomedical RAG blueprint that adds rationale generation and statement-level verification.

An eight-category taxonomy for checking whether each rationale statement is supported, contradicted, irrelevant, or missing evidence.

Key Findings

Generating explicit evidence-linked rationales improves accuracy over vanilla RAG on evaluated biomedical QA.

NumbersBioASQ: Vanilla RAG 82.3% → +Rationale 85.8% (Table 2)

Practical UseAdd a rationale-generation step to RAG pipelines to reduce hallucinations and boost accuracy on literature-grounded QA.

Evidence RefTable 2

Reranking retrieved passages with a BGE cross-encoder can yield large gains in few-shot settings.

NumbersPubMedQA 4-shot: w/o rerank 47.5% → with rerank 60.0% (+12.5 pts) (Table 3)

Practical UseUse a cross-encoder reranker when few demonstrations are used; it filters noisy passages that mislead the reasoner.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy89.1%MedRAG+GPT-3.5 90.29%-1.19 ptsBioASQ-Y/N (best: 3-shot Dynamic ICL + rerank)Table 2 reports 89.1% for best config on BioASQTable 2
Accuracy73.0%MedRAG+GPT-4 70.60%+2.4 ptsPubMedQA (0-shot rationale generation)Table 2 shows 73.0% for 0-shot rationale gen outperforming MedRAG+GPT-4Table 2

What To Try In 7 Days

Add a rationale-generation prompt that asks the model to cite passage IDs for each sub-claim.

Plug a lightweight reranker (BGE-style cross-encoder) after BM25 and feed top-5 passages to the generator.

Create a small pool of model-generated demonstrations and implement embedding-based nearest-neighbor selection for dynamic in-context examples.

Optimization Features

Token Efficiency

Dynamic demonstration selection to avoid context-window competition between evidence and examples

System Optimization
Optional query rewriting triggered when lexical overlap is lowDeterministic segmentation of rationales for verification
Inference Optimization
Use BGE reranker to reduce noisy contextLimit evidence to top-5 passages to save tokens

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseCC BY 4.0 (paper text); code not provided

Data URLs

MedRAG toolkit (PubMed abstracts) referenced in paperDemo: https://huggingface.co/spaces/DialogueRobust/RobustDialogueDemo

Risks & Boundaries

Limitations

Evaluation limited to two English biomedical datasets; may not generalize to other domains or languages.

Parts of pipeline use OpenAI APIs, adding latency and cost for deployment.

When Not To Use

As a clinical decision system without extensive clinical validation and clinician-in-the-loop testing.

In domains where a high-quality retriever or corpus is not available.

Failure Modes

Reranker may still surface irrelevant passages, causing incorrect grounding.

LLM verifier can be more permissive than humans and miss subtle unsupported inferences.

Core Entities

Models

Llama-3-8B-InstructGPT-4oGPT-4GPT-3.5BGE cross-encoder (BGE-v2-m3)

Metrics

AccuracyFaithfulness score (proportion of supported statements)Cohen's kappaPer-category F1

Datasets

BioASQPubMedQAPubMed abstractsMIRAGE benchmark

Benchmarks

BioASQPubMedQAMIRAGE