Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
2
Why It Matters For Business
Shows a practical path to make automated CXR outputs explainable: you can keep high accuracy while surfacing concept evidence and improve report quality with a multi-agent retrieval pipeline.
Summary TLDR
The paper builds an interpretable chest X‑ray pipeline: a Concept Bottleneck Model (CBM) maps image features to 20 human‑readable clinical concepts, then a multi‑agent Retrieval-Augmented Generation (RAG) system (ReAct agents + Radiologist + Medical Writer) retrieves clinical documents and composes reports. On the COVID-QU set (33,920 images) the CBM reached 81% classification accuracy. Multi‑agent report generation improved LLM-judge metrics (examples: Correctness 0.85 → 0.95, Clinical Usefulness 0.92 → 0.96 for Mistral 7B). The method supports interventions (fixing 3–4 concepts often corrects errors) and trades slightly worse clustering scores for more clinically realistic reports.
Problem Statement
Deep CXR classifiers are accurate but hard to trust because they are black boxes. Automated report generators can be factually inconsistent. The paper aims to make CXR classification and report generation explainable by exposing concept-level evidence and using a multi-agent retrieval and writing pipeline.
Main Contribution
Combine automatic concept discovery CBM with ChexAgent image embeddings to output concept vectors that explain disease predictions.
Introduce a multi-agent RAG system: disease-specific ReAct agents, a Radiologist agent to score concept influence, and a Medical Writer agent to compose reports.
Evaluate on COVID-QU (33,920 images): 81% classification accuracy and LLM-judge gains in report correctness and clinical usefulness versus single-agent baselines.
Show concept interventions: correcting 3–4 top contributing concepts often fixes misclassifications, demonstrating actionable interpretability.
Open-source intent: authors state code will be released (GitHub link provided in paper).
Key Findings
CBM classification accuracy on COVID-QU
Multi-agent RAG improves report correctness and clinical usefulness
Concept intervention corrects many misclassifications
Results
Accuracy
Report correctness (LLM judge)
Clinical usefulness (LLM judge)
Clustering quality (Silhouette)
Who Should Care
What To Try In 7 Days
Run a CBM on a sample CXR set to surface concept scores for each prediction.
Prototype a simple ReAct retrieval agent querying a Qdrant index of clinical docs.
Add a lightweight 'concept intervention' step: allow a clinician to correct top 3 concept scores and re-evaluate predictions on misclassified cases.
Agent Features
Memory
- retrieval memory via document embeddings
Planning
- sequential agent pipeline (retrieve → analyze → write)
Tool Use
- vector DB retrieval (Qdrant)
- LLMs for embedding and judging
- VLM for image embeddings (ChexAgent)
Frameworks
- CrewAI
- LlamaIndex
Is Agentic
true
Architectures
- ReAct agent per disease
- Radiologist agent
- Medical Writer agent
Collaboration
- agent-to-agent handoff (ReAct → Radiologist → Writer)
Reproducibility
Data Urls
- COVID-QU dataset (Chowdhury et al., 2020, IEEE Access) referenced in paper
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation limited to COVID-QU dataset; generalization to other hospitals is untested.
- Report judgments rely on LLMs, which can reflect model biases and not replace clinician evaluation.
- Concept discovery depends on GPT-4 prompts and choices of descriptors, which may vary.
- No human clinician study reported for real-world workflow integration.
When Not To Use
- Do not deploy for direct clinical decision-making without clinician validation.
- Avoid using this exact pipeline on non-CXR imaging without revalidation.
- Not suited for real-time triage if low-latency constraints exist due to multi-agent retrieval.
Failure Modes
- Wrong concept extraction leads to incorrect diagnosis and misleading reports.
- Retrieval returns noisy or irrelevant documents that drive incorrect explanations.
- LLM-based judge overestimates report quality or misses subtle clinical errors.
Core Entities
Models
- ChexAgent
- Concept Bottleneck Model (CBM)
- CLIP
- Bio-VIL
- Label-free CBM
- Robust CBM
- GPT-4
- Mistral Embed Model
- Mistral 7B
- Llama 3.1
- Gemma2
- LLaVA
- GPT-3.5 Turbo
- Dragonfly-Med
- Medllama2
Metrics
- Accuracy
- semantic similarity
- correctness
- clinical usefulness
- consistency
- Silhouette
- Davies-Bouldin
- Calinski-Harabasz
- Dunn
Datasets
- COVID-QU (33,920 CXR images)

