Interpret chest X‑ray reports by combining concept bottlenecks with a multi‑agent retrieval system

Overview

Decision SnapshotNeeds Validation

Prototype-stage system: promising accuracy and report gains on one public dataset, validated by LLM judges but lacking clinical user studies and broad external validation.

Citations2

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Shows a practical path to make automated CXR outputs explainable: you can keep high accuracy while surfacing concept evidence and improve report quality with a multi-agent retrieval pipeline.

Who Should Care

Product Manager ML Engineer CTO Data Scientist Engineering Lead

Summary TLDR

The paper builds an interpretable chest X‑ray pipeline: a Concept Bottleneck Model (CBM) maps image features to 20 human‑readable clinical concepts, then a multi‑agent Retrieval-Augmented Generation (RAG) system (ReAct agents + Radiologist + Medical Writer) retrieves clinical documents and composes reports. On the COVID-QU set (33,920 images) the CBM reached 81% classification accuracy. Multi‑agent report generation improved LLM-judge metrics (examples: Correctness 0.85 → 0.95, Clinical Usefulness 0.92 → 0.96 for Mistral 7B). The method supports interventions (fixing 3–4 concepts often corrects errors) and trades slightly worse clustering scores for more clinically realistic reports.

Problem Statement

Deep CXR classifiers are accurate but hard to trust because they are black boxes. Automated report generators can be factually inconsistent. The paper aims to make CXR classification and report generation explainable by exposing concept-level evidence and using a multi-agent retrieval and writing pipeline.

Main Contribution

Combine automatic concept discovery CBM with ChexAgent image embeddings to output concept vectors that explain disease predictions.

Introduce a multi-agent RAG system: disease-specific ReAct agents, a Radiologist agent to score concept influence, and a Medical Writer agent to compose reports.

Key Findings

CBM classification accuracy on COVID-QU

Numbers81% accuracy on Covid-QU (Table 1)

Practical UseYou can get competitive accuracy while exposing concept scores for each image, so clinicians can see why the model decided.

Evidence RefTable 1

Multi-agent RAG improves report correctness and clinical usefulness

NumbersCorrectness 0.85 → 0.95; Clinical Usefulness 0.92 → 0.96 (Mistral 7B, Table 3)

Practical UseUsing multiple specialized agents for retrieval and writing yields more clinically accurate and useful reports than a single-agent RAG.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.81	Bio-VIL 0.78	+0.03 vs best baseline	COVID-QU test	Table 1 lists 0.81 for our model vs 0.78 for Bio-VIL	Table 1
Report correctness (LLM judge)	0.95	Single-agent 0.85	+0.10	Generated reports evaluated by Mistral 7B	Paper notes Correctness increased from 0.85 to 0.95 for Mistral 7B	Table 3

What To Try In 7 Days

Run a CBM on a sample CXR set to surface concept scores for each prediction.

Prototype a simple ReAct retrieval agent querying a Qdrant index of clinical docs.

Add a lightweight 'concept intervention' step: allow a clinician to correct top 3 concept scores and re-evaluate predictions on misclassified cases.

Agent Features

Memory

retrieval memory via document embeddings

Planning

sequential agent pipeline (retrieve → analyze → write)

Tool Use

vector DB retrieval (Qdrant)LLMs for embedding and judgingVLM for image embeddings (ChexAgent)

Frameworks

CrewAILlamaIndex

Is Agentic

Yes

Architectures

ReAct agent per diseaseRadiologist agentMedical Writer agent

Collaboration

agent-to-agent handoff (ReAct → Radiologist → Writer)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/tifat58/IRRwith-CBM-RAG

Data URLs

COVID-QU dataset (Chowdhury et al., 2020, IEEE Access) referenced in paper

Risks & Boundaries

Limitations

Evaluation limited to COVID-QU dataset; generalization to other hospitals is untested.

Report judgments rely on LLMs, which can reflect model biases and not replace clinician evaluation.

When Not To Use

Do not deploy for direct clinical decision-making without clinician validation.

Avoid using this exact pipeline on non-CXR imaging without revalidation.

Failure Modes

Wrong concept extraction leads to incorrect diagnosis and misleading reports.

Retrieval returns noisy or irrelevant documents that drive incorrect explanations.

Core Entities

Models

ChexAgentConcept Bottleneck Model (CBM)CLIPBio-VILLabel-free CBMRobust CBMGPT-4Mistral Embed ModelMistral 7BLlama 3.1Gemma2LLaVAGPT-3.5 TurboDragonfly-MedMedllama2

Metrics

Accuracysemantic similaritycorrectnessclinical usefulnessconsistencySilhouetteDavies-BouldinCalinski-HarabaszDunn

Datasets

COVID-QU (33,920 CXR images)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CBM classification accuracy on COVID-QU

Multi-agent RAG improves report correctness and clinical usefulness

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding