MEDCO: a multi-agent copilot that trains medical students via patient, radiologist, and expert role-play

August 22, 20247 min

Overview

Decision SnapshotNeeds Validation

MEDCO is a working prototype validated on simulated LLM students and a 506-case dataset; promising signals exist but no human-student trials or public code are provided.

Citations2

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Hao Wei, Jianing Qiu, Haibao Yu, Wu Yuan

Links

Abstract / PDF

Why It Matters For Business

MEDCO shows multi-role simulation plus a small retrieval memory can lift weaker models to near-strong performance in simulated medical training, which can differentiate education products and reduce instructor bandwidth.

Who Should Care

Summary TLDR

MEDCO is a multi-agent training system that uses three LLM roles (patient, radiologist, medical expert) plus a student agent and a small retrieval memory to simulate real clinical learning. In simulated experiments on a 506-case Chinese medical dataset (MVME) MEDCO boosts diagnostic scores: weak models (GPT-3.5) improve average expert-rated diagnosis scores (HDE) from ~1.97 to ~2.17 after learning and to ~2.30 with peer discussion. Retrieval memory and multi-modal image interpretation further raise ICD-10 F1 and fine-grained accuracy. Results are promising but come from LLM role-play, not human students.

Problem Statement

Single-chatbot tools cannot mimic the multidisciplinary, interactive nature of real medical training. Students need practice asking the right questions, multidisciplinary feedback, and memory of past cases — capabilities missing from solitary LLM tutors. MEDCO builds a multi-agent copilot to simulate patients, specialists, and expert feedback to train students and test whether this improves diagnostic learning.

Main Contribution

Design of MEDCO: a prompt-driven multi-agent copilot with four roles (patient, student, radiologist, medical expert) and tools for image/report interpretation.

A lightweight memory design (case-store, symptom-store, disease-store) and ICD-10–based hierarchical evaluation (coarse/medium/fine).

Key Findings

MEDCO raises average expert-rated diagnostic score (HDE) of a weak student (GPT-3.5).

NumbersHDE avg: 1.9652.169 (knowledge) → 2.299 (peer discussion)

Practical UseUse multi-role feedback and memory recall to meaningfully improve diagnosis skills of weaker models and likely accelerate novice learning in training simulators.

Evidence RefTable 1

Peer discussion notably boosts entity recall and F1 in ICD-10 matching for the weak student.

NumbersSEMA recall 17.95%29.72%; F1 26.01%36.04%

Practical UseAdding peer-discussion workflows to training can increase the set-level disease recall and overall diagnostic agreement, but expect more extracted entities (possible false positives).

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HDE average (expert rating)GPT-3.5: 1.9652.169 (w/ knowledge) → 2.299 (peer discussion)1.965 (GPT-3.5 untrained)+0.204 (w/ knowledge) ; +0.334 (peer discussion vs baseline)Entire test set (247 cases)Table 1: HDE resultsTable 1
SEMA recall / F1 (ICD-10 matching)GPT-3.5 recall 17.95%29.72%, F1 26.01%36.04% (peer discussion)17.95% recall, 26.01% F1+11.77 pp recall; +10.03 pp F1Entire test setTable 2 SEMA rowsTable 2

What To Try In 7 Days

Prototype a 3-role prompt flow (student/patient/expert) with one LLM and scripted cases to test HDE-like ratings.

Add a simple vector DB (Chromadb) with 50 case summaries and test recall-based follow-up question prompts.

Run A/B: solo practice vs peer-discussion-style prompts on 50 clinical cases to measure entity recall and F1 change.

Agent Features

Memory
key-value memory (case-store, symptom-store, disease-store)retrieval with Chromadb + embeddings
Tool Use
image interpreting toolreport VQA tool
Frameworks
prompt templates for each role
Is Agentic

Yes

Architectures
prompt-driven multi-agent roles
Collaboration
peer discussionmulti-department (radiologist + expert) coordination

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Students in experiments are LLMs, not real medical students, so human learning impact is unknown

Limited multi-modal validation (only 16 Neurology cases with collected images)

When Not To Use

High-stakes clinical decision-making without human clinician oversight

Regulated settings requiring certified medical training records

Failure Modes

LLM hallucinations in diagnosis or image interpretation

Memory retrieval producing irrelevant/over-extracted entities (false positives)

Core Entities

Models

GPT-3.5GPT-4o-miniClaude3.5-Sonnet

Metrics

HDE (1-4 expert rating avg)SEMA: precision/recall/F1 on ICD-10 entity matchingAccuracy

Datasets

MVME (Chinese medical records, 506 cases)

Benchmarks

HDESEMACASCADE