Overview
MEDCO is a working prototype validated on simulated LLM students and a 506-case dataset; promising signals exist but no human-student trials or public code are provided.
Citations2
Evidence Strength0.60
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
MEDCO shows multi-role simulation plus a small retrieval memory can lift weaker models to near-strong performance in simulated medical training, which can differentiate education products and reduce instructor bandwidth.
Who Should Care
Summary TLDR
MEDCO is a multi-agent training system that uses three LLM roles (patient, radiologist, medical expert) plus a student agent and a small retrieval memory to simulate real clinical learning. In simulated experiments on a 506-case Chinese medical dataset (MVME) MEDCO boosts diagnostic scores: weak models (GPT-3.5) improve average expert-rated diagnosis scores (HDE) from ~1.97 to ~2.17 after learning and to ~2.30 with peer discussion. Retrieval memory and multi-modal image interpretation further raise ICD-10 F1 and fine-grained accuracy. Results are promising but come from LLM role-play, not human students.
Problem Statement
Single-chatbot tools cannot mimic the multidisciplinary, interactive nature of real medical training. Students need practice asking the right questions, multidisciplinary feedback, and memory of past cases — capabilities missing from solitary LLM tutors. MEDCO builds a multi-agent copilot to simulate patients, specialists, and expert feedback to train students and test whether this improves diagnostic learning.
Main Contribution
Design of MEDCO: a prompt-driven multi-agent copilot with four roles (patient, student, radiologist, medical expert) and tools for image/report interpretation.
A lightweight memory design (case-store, symptom-store, disease-store) and ICD-10–based hierarchical evaluation (coarse/medium/fine).
Key Findings
MEDCO raises average expert-rated diagnostic score (HDE) of a weak student (GPT-3.5).
Peer discussion notably boosts entity recall and F1 in ICD-10 matching for the weak student.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| HDE average (expert rating) | GPT-3.5: 1.965 → 2.169 (w/ knowledge) → 2.299 (peer discussion) | 1.965 (GPT-3.5 untrained) | +0.204 (w/ knowledge) ; +0.334 (peer discussion vs baseline) | Entire test set (247 cases) | Table 1: HDE results | Table 1 |
| SEMA recall / F1 (ICD-10 matching) | GPT-3.5 recall 17.95% → 29.72%, F1 26.01% → 36.04% (peer discussion) | 17.95% recall, 26.01% F1 | +11.77 pp recall; +10.03 pp F1 | Entire test set | Table 2 SEMA rows | Table 2 |
What To Try In 7 Days
Prototype a 3-role prompt flow (student/patient/expert) with one LLM and scripted cases to test HDE-like ratings.
Add a simple vector DB (Chromadb) with 50 case summaries and test recall-based follow-up question prompts.
Run A/B: solo practice vs peer-discussion-style prompts on 50 clinical cases to measure entity recall and F1 change.
Agent Features
Memory
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
Students in experiments are LLMs, not real medical students, so human learning impact is unknown
Limited multi-modal validation (only 16 Neurology cases with collected images)
When Not To Use
High-stakes clinical decision-making without human clinician oversight
Regulated settings requiring certified medical training records
Failure Modes
LLM hallucinations in diagnosis or image interpretation
Memory retrieval producing irrelevant/over-extracted entities (false positives)

