MEDCO: a multi-agent copilot that trains medical students via patient, radiologist, and expert role-play

August 22, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

2

Authors

Hao Wei, Jianing Qiu, Haibao Yu, Wu Yuan

Links

Abstract / PDF

Why It Matters For Business

MEDCO shows multi-role simulation plus a small retrieval memory can lift weaker models to near-strong performance in simulated medical training, which can differentiate education products and reduce instructor bandwidth.

Summary TLDR

MEDCO is a multi-agent training system that uses three LLM roles (patient, radiologist, medical expert) plus a student agent and a small retrieval memory to simulate real clinical learning. In simulated experiments on a 506-case Chinese medical dataset (MVME) MEDCO boosts diagnostic scores: weak models (GPT-3.5) improve average expert-rated diagnosis scores (HDE) from ~1.97 to ~2.17 after learning and to ~2.30 with peer discussion. Retrieval memory and multi-modal image interpretation further raise ICD-10 F1 and fine-grained accuracy. Results are promising but come from LLM role-play, not human students.

Problem Statement

Single-chatbot tools cannot mimic the multidisciplinary, interactive nature of real medical training. Students need practice asking the right questions, multidisciplinary feedback, and memory of past cases — capabilities missing from solitary LLM tutors. MEDCO builds a multi-agent copilot to simulate patients, specialists, and expert feedback to train students and test whether this improves diagnostic learning.

Main Contribution

Design of MEDCO: a prompt-driven multi-agent copilot with four roles (patient, student, radiologist, medical expert) and tools for image/report interpretation.

A lightweight memory design (case-store, symptom-store, disease-store) and ICD-10–based hierarchical evaluation (coarse/medium/fine).

Simulated experiments showing that agentic learning, retrieval-based recall, and peer discussion improve diagnostic scores and ICD-10 matching on MVME cases.

Key Findings

MEDCO raises average expert-rated diagnostic score (HDE) of a weak student (GPT-3.5).

NumbersHDE avg: 1.965 → 2.169 (knowledge) → 2.299 (peer discussion)

Peer discussion notably boosts entity recall and F1 in ICD-10 matching for the weak student.

NumbersSEMA recall 17.95% → 29.72%; F1 26.01% → 36.04%

Strong models also benefit: Claude3.5-Sonnet improves after MEDCO training.

NumbersHDE avg: 2.283 → 2.693 (suggestions); ICD-10 F1 31.25 → 38.70

Multi-modal input (images/reports) increases fine-grained diagnostic accuracy and ICD F1.

NumbersGPT-3.5 F1 14.08% → 23.53%; Claude F1 29.73% → 38.89%

Results

HDE average (expert rating)

ValueGPT-3.5: 1.965 → 2.169 (w/ knowledge) → 2.299 (peer discussion)

Baseline1.965 (GPT-3.5 untrained)

SEMA recall / F1 (ICD-10 matching)

ValueGPT-3.5 recall 17.95% → 29.72%, F1 26.01% → 36.04% (peer discussion)

Baseline17.95% recall, 26.01% F1

Accuracy

ValueGPT-3.5 coarse 43.72% → 46.67% (w/ knowledge); Claude3.5 baseline 48.26%

Baseline43.72% (GPT-3.5)

Multi-modal ICD-10 F1 (Neurology subset)

ValueGPT-3.5 F1 14.08% → 23.53% (multi-modal); Claude F1 29.73% → 38.89%

BaselineText-only results shown in Table 8/9

Who Should Care

What To Try In 7 Days

Prototype a 3-role prompt flow (student/patient/expert) with one LLM and scripted cases to test HDE-like ratings.

Add a simple vector DB (Chromadb) with 50 case summaries and test recall-based follow-up question prompts.

Run A/B: solo practice vs peer-discussion-style prompts on 50 clinical cases to measure entity recall and F1 change.

Agent Features

Memory

  • key-value memory (case-store, symptom-store, disease-store)
  • retrieval with Chromadb + embeddings

Tool Use

  • image interpreting tool
  • report VQA tool

Frameworks

  • prompt templates for each role

Is Agentic

true

Architectures

  • prompt-driven multi-agent roles

Collaboration

  • peer discussion
  • multi-department (radiologist + expert) coordination

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Students in experiments are LLMs, not real medical students, so human learning impact is unknown
  • Limited multi-modal validation (only 16 Neurology cases with collected images)
  • Expert feedback is currently text-only; no multi-modal expert exemplars were used
  • Peer discussion increases recall but can raise false positives in ICD extraction

When Not To Use

  • High-stakes clinical decision-making without human clinician oversight
  • Regulated settings requiring certified medical training records
  • As a sole assessment for student certification

Failure Modes

  • LLM hallucinations in diagnosis or image interpretation
  • Memory retrieval producing irrelevant/over-extracted entities (false positives)
  • Bias from using the same or similar LLMs for both student and expert roles
  • Overfitting to MVME dataset idiosyncrasies

Core Entities

Models

  • GPT-3.5
  • GPT-4o-mini
  • Claude3.5-Sonnet

Metrics

  • HDE (1-4 expert rating avg)
  • SEMA: precision/recall/F1 on ICD-10 entity matching
  • Accuracy

Datasets

  • MVME (Chinese medical records, 506 cases)

Benchmarks

  • HDE
  • SEMA
  • CASCADE