Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
2
Why It Matters For Business
MEDCO shows multi-role simulation plus a small retrieval memory can lift weaker models to near-strong performance in simulated medical training, which can differentiate education products and reduce instructor bandwidth.
Summary TLDR
MEDCO is a multi-agent training system that uses three LLM roles (patient, radiologist, medical expert) plus a student agent and a small retrieval memory to simulate real clinical learning. In simulated experiments on a 506-case Chinese medical dataset (MVME) MEDCO boosts diagnostic scores: weak models (GPT-3.5) improve average expert-rated diagnosis scores (HDE) from ~1.97 to ~2.17 after learning and to ~2.30 with peer discussion. Retrieval memory and multi-modal image interpretation further raise ICD-10 F1 and fine-grained accuracy. Results are promising but come from LLM role-play, not human students.
Problem Statement
Single-chatbot tools cannot mimic the multidisciplinary, interactive nature of real medical training. Students need practice asking the right questions, multidisciplinary feedback, and memory of past cases — capabilities missing from solitary LLM tutors. MEDCO builds a multi-agent copilot to simulate patients, specialists, and expert feedback to train students and test whether this improves diagnostic learning.
Main Contribution
Design of MEDCO: a prompt-driven multi-agent copilot with four roles (patient, student, radiologist, medical expert) and tools for image/report interpretation.
A lightweight memory design (case-store, symptom-store, disease-store) and ICD-10–based hierarchical evaluation (coarse/medium/fine).
Simulated experiments showing that agentic learning, retrieval-based recall, and peer discussion improve diagnostic scores and ICD-10 matching on MVME cases.
Key Findings
MEDCO raises average expert-rated diagnostic score (HDE) of a weak student (GPT-3.5).
Peer discussion notably boosts entity recall and F1 in ICD-10 matching for the weak student.
Strong models also benefit: Claude3.5-Sonnet improves after MEDCO training.
Multi-modal input (images/reports) increases fine-grained diagnostic accuracy and ICD F1.
Results
HDE average (expert rating)
SEMA recall / F1 (ICD-10 matching)
Accuracy
Multi-modal ICD-10 F1 (Neurology subset)
Who Should Care
What To Try In 7 Days
Prototype a 3-role prompt flow (student/patient/expert) with one LLM and scripted cases to test HDE-like ratings.
Add a simple vector DB (Chromadb) with 50 case summaries and test recall-based follow-up question prompts.
Run A/B: solo practice vs peer-discussion-style prompts on 50 clinical cases to measure entity recall and F1 change.
Agent Features
Memory
- key-value memory (case-store, symptom-store, disease-store)
- retrieval with Chromadb + embeddings
Tool Use
- image interpreting tool
- report VQA tool
Frameworks
- prompt templates for each role
Is Agentic
true
Architectures
- prompt-driven multi-agent roles
Collaboration
- peer discussion
- multi-department (radiologist + expert) coordination
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Students in experiments are LLMs, not real medical students, so human learning impact is unknown
- Limited multi-modal validation (only 16 Neurology cases with collected images)
- Expert feedback is currently text-only; no multi-modal expert exemplars were used
- Peer discussion increases recall but can raise false positives in ICD extraction
When Not To Use
- High-stakes clinical decision-making without human clinician oversight
- Regulated settings requiring certified medical training records
- As a sole assessment for student certification
Failure Modes
- LLM hallucinations in diagnosis or image interpretation
- Memory retrieval producing irrelevant/over-extracted entities (false positives)
- Bias from using the same or similar LLMs for both student and expert roles
- Overfitting to MVME dataset idiosyncrasies
Core Entities
Models
- GPT-3.5
- GPT-4o-mini
- Claude3.5-Sonnet
Metrics
- HDE (1-4 expert rating avg)
- SEMA: precision/recall/F1 on ICD-10 entity matching
- Accuracy
Datasets
- MVME (Chinese medical records, 506 cases)
Benchmarks
- HDE
- SEMA
- CASCADE

