ClinicalAgent: a GPT-4 multi-agent system that uses external databases to predict clinical trial outcomes

April 23, 20246 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

3

Authors

Ling Yue, Sixue Xing, Jintai Chen, Tianfan Fu

Links

Abstract / PDF

Why It Matters For Business

ClinicalAgent shows a practical path to combine LLM reasoning and domain databases to flag risky trials and estimate enrollment, speeding early-stage decisions; validate on larger data before clinical use.

Summary TLDR

ClinicalAgent is a system that turns GPT-4 into a team of specialist agents (planning, efficacy, safety, enrollment) that call external databases (DrugBank, Hetionet, ClinicalTrials.gov) and predictive models to assess clinical trials. On a small benchmark (40 train / 40 test samples) it reached ROC-AUC 0.8347 and PR-AUC 0.7908, improving PR-AUC by 0.3326 over direct GPT prompting. The design shows how function-calling and stepwise reasoning (Least-to-Most, ReAct) can make LLM outputs more actionable, but the evaluation is small and depends on closed-source LLM APIs.

Problem Statement

Make LLMs useful for clinical trial tasks by combining GPT-4, stepwise reasoning, external biomedical databases, and specialist predictive models so the system can predict trial outcomes, enrollment difficulty, safety, and efficacy with explainable steps.

Main Contribution

ClinicalAgent: a multi-agent framework that delegates trial tasks to specialist agents and aggregates their outputs for explainable decisions.

Integration of tool calls (DrugBank, Hetionet, ClinicalTrials.gov) and small predictive models (enrollment, drug/disease risk) into LLM reasoning.

Empirical result: on the paper's small evaluation, ClinicalAgent achieved PR-AUC 0.7908 (+0.3326 vs standard GPT prompting) and ROC-AUC 0.8347, showing improved LLM performance on the tested benchmark.

Key Findings

ClinicalAgent raised precision-recall performance over direct GPT prompting.

NumbersPR-AUC 0.7908 (+0.3326 vs GPT-4 prompt)

ClinicalAgent matched or exceeded classical ML on ROC-AUC in this test.

NumbersROC-AUC 0.8347 (GBDT 0.8)

Few-shot reasoning improved ranking and PR performance within ClinicalAgent.

NumbersPR-AUC 0.7908 (with few-shot) vs 0.6793 (without)

Evaluation used a very small sample.

Numbers40 training / 40 test samples

Results

PR-AUC

Value0.7908

BaselineGPT-4 standard prompt 0.4582

ROC-AUC

Value0.8347

BaselineGBDT 0.8

Accuracy

Value0.70

BaselineHAtten 0.75

Who Should Care

What To Try In 7 Days

Prototype a simple agent pipeline: planning + one specialist (enrollment) using GPT-4 function calls.

Hook a single database (DrugBank or ClinicalTrials.gov) via a function-call wrapper and test retrieval accuracy.

Add a few-shot decomposition prompt and compare PR-AUC vs direct GPT prompts on a small held-out set.

Agent Features

Memory

  • indexing database outputs

Planning

  • problem decomposition
  • few-shot planning
  • Least-to-Most reasoning

Tool Use

  • function calling to databases
  • knowledge graph retrieval
  • external predictive models

Frameworks

  • ReAct
  • LEAST-TO-MOST

Is Agentic

true

Architectures

  • multi-agent
  • specialist agents (planning, efficacy, safety, enrollment)
  • hierarchical transformer (enrollment model)

Collaboration

  • agent coordination via a central Planning/Reasoning agent
  • role-based task assignment

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation used only 40 training and 40 test samples, limiting statistical confidence.
  • Relies on closed-source GPT-4 API; behaviour and costs depend on external provider.
  • Some key baselines (e.g., HAtten) were not integrated as external tools in this study.
  • LLM-generated data are used but not fully validated or released in processed form.

When Not To Use

  • For high-stakes, regulatory decisions without human clinical review.
  • When full patient-data privacy forbids external API calls.
  • As a sole source of truth when large labeled datasets and validated classical models are available.

Failure Modes

  • Incorrect or missing database lookups leading to wrong conclusions.
  • LLM hallucinations despite tool calls, especially for rare drugs/diseases.
  • Overfitting to the small evaluation sample; results may not generalize.

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • BioBERT
  • GBDT (LightGBM)
  • Hierarchical transformer enrollment model
  • HAtten

Metrics

  • ROC-AUC
  • PR-AUC
  • Accuracy
  • Precision
  • Recall
  • F1

Datasets

  • ClinicalTrials.gov
  • Clinical trial outcome prediction benchmark (from refs [5,6])