MEDCO: a multi-agent copilot that trains medical students via patient, radiologist, and expert role-play

Overview

Decision SnapshotNeeds Validation

MEDCO is a working prototype validated on simulated LLM students and a 506-case dataset; promising signals exist but no human-student trials or public code are provided.

Citations2

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Hao Wei, Jianing Qiu, Haibao Yu, Wu Yuan

Links

Abstract / PDF

Why It Matters For Business

MEDCO shows multi-role simulation plus a small retrieval memory can lift weaker models to near-strong performance in simulated medical training, which can differentiate education products and reduce instructor bandwidth.

Who Should Care

Product Manager ML Engineer Data Scientist

Summary TLDR

MEDCO is a multi-agent training system that uses three LLM roles (patient, radiologist, medical expert) plus a student agent and a small retrieval memory to simulate real clinical learning. In simulated experiments on a 506-case Chinese medical dataset (MVME) MEDCO boosts diagnostic scores: weak models (GPT-3.5) improve average expert-rated diagnosis scores (HDE) from ~1.97 to ~2.17 after learning and to ~2.30 with peer discussion. Retrieval memory and multi-modal image interpretation further raise ICD-10 F1 and fine-grained accuracy. Results are promising but come from LLM role-play, not human students.

Problem Statement

Single-chatbot tools cannot mimic the multidisciplinary, interactive nature of real medical training. Students need practice asking the right questions, multidisciplinary feedback, and memory of past cases — capabilities missing from solitary LLM tutors. MEDCO builds a multi-agent copilot to simulate patients, specialists, and expert feedback to train students and test whether this improves diagnostic learning.

Main Contribution

Design of MEDCO: a prompt-driven multi-agent copilot with four roles (patient, student, radiologist, medical expert) and tools for image/report interpretation.

A lightweight memory design (case-store, symptom-store, disease-store) and ICD-10–based hierarchical evaluation (coarse/medium/fine).

Key Findings

MEDCO raises average expert-rated diagnostic score (HDE) of a weak student (GPT-3.5).

NumbersHDE avg: 1.965 → 2.169 (knowledge) → 2.299 (peer discussion)

Practical UseUse multi-role feedback and memory recall to meaningfully improve diagnosis skills of weaker models and likely accelerate novice learning in training simulators.

Evidence RefTable 1

Peer discussion notably boosts entity recall and F1 in ICD-10 matching for the weak student.

NumbersSEMA recall 17.95% → 29.72%; F1 26.01% → 36.04%

Practical UseAdding peer-discussion workflows to training can increase the set-level disease recall and overall diagnostic agreement, but expect more extracted entities (possible false positives).

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HDE average (expert rating)	GPT-3.5: 1.965 → 2.169 (w/ knowledge) → 2.299 (peer discussion)	1.965 (GPT-3.5 untrained)	+0.204 (w/ knowledge) ; +0.334 (peer discussion vs baseline)	Entire test set (247 cases)	Table 1: HDE results	Table 1
SEMA recall / F1 (ICD-10 matching)	GPT-3.5 recall 17.95% → 29.72%, F1 26.01% → 36.04% (peer discussion)	17.95% recall, 26.01% F1	+11.77 pp recall; +10.03 pp F1	Entire test set	Table 2 SEMA rows	Table 2

What To Try In 7 Days

Prototype a 3-role prompt flow (student/patient/expert) with one LLM and scripted cases to test HDE-like ratings.

Add a simple vector DB (Chromadb) with 50 case summaries and test recall-based follow-up question prompts.

Run A/B: solo practice vs peer-discussion-style prompts on 50 clinical cases to measure entity recall and F1 change.

Agent Features

Memory

key-value memory (case-store, symptom-store, disease-store)retrieval with Chromadb + embeddings

Tool Use

image interpreting toolreport VQA tool

Frameworks

prompt templates for each role

Is Agentic

Yes

Architectures

prompt-driven multi-agent roles

Collaboration

peer discussionmulti-department (radiologist + expert) coordination

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Students in experiments are LLMs, not real medical students, so human learning impact is unknown

Limited multi-modal validation (only 16 Neurology cases with collected images)

When Not To Use

High-stakes clinical decision-making without human clinician oversight

Regulated settings requiring certified medical training records

Failure Modes

LLM hallucinations in diagnosis or image interpretation

Memory retrieval producing irrelevant/over-extracted entities (false positives)

Core Entities

Models

GPT-3.5GPT-4o-miniClaude3.5-Sonnet

Metrics

HDE (1-4 expert rating avg)SEMA: precision/recall/F1 on ICD-10 entity matchingAccuracy

Datasets

MVME (Chinese medical records, 506 cases)

Benchmarks

HDESEMACASCADE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MEDCO raises average expert-rated diagnostic score (HDE) of a weak student (GPT-3.5).

Peer discussion notably boosts entity recall and F1 in ICD-10 matching for the weak student.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding