DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Overview

Decision SnapshotNeeds Validation

The paper provides clear ablations and a custom benchmark showing big gains for tool-calling; evidence is strong on synthetic and case-study tasks but limited on large-scale real-world wet-lab pipelines.

Citations3

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Kun Li, Zhennan Wu, Shoupeng Wang, Jia Wu, Shirui Pan, Wenbin Hu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

DrugPilot cuts manual tool switching and context failures by structuring inputs as key-value parameters, improving automation accuracy and runtime for multi-step drug workflows.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder

Summary TLDR

DrugPilot is an LLM-based agent designed for end-to-end drug discovery workflows. It introduces a parameterized memory pool (PMP) that stores large multimodal drug data as key-value pairs, and a feedback-focus (Fe-Fo) mechanism that checks and corrects tool calls. The authors release TCDD, a 2,800-sample tool-calling dataset, fine-tune LLMs, and show major gains on a function-calling benchmark: task completion rates of 98.0% (simple), 93.5% (multi-tool), and 64.0% (multi-turn). PMP removes large parameters from the model context, enabling large-batch processing and faster, more accurate multi-step tool use. Code and data links are provided.

Problem Statement

LLM agents struggle when drug discovery tasks need large multimodal inputs, precise tool calls, and multi-turn workflows. Text-only memory overloads the model context, tool selection and parameter passing become error-prone, and users without coding skills cannot chain domain tools reliably.

Main Contribution

Parameterized Memory Pool (PMP): a key-value store that removes large parameters from LLM context and supplies structured inputs to tools.

Feedback-Focus (Fe-Fo): an error-detection and feedback loop that restates tasks and provides corrective prompts for tool-calling mistakes.

Key Findings

High task-completion on TCDD tool-calling benchmark.

NumbersTask completion: 98.0% (simple), 93.5% (multi-tool), 64.0% (multi-turn)

Practical UseExpect much fewer manual corrections when automating single- and multi-tool drug workflows on evaluated benchmarks.

Evidence RefAbstract; Results; Table 1

Large improvements over a SOTA agent (ReAct) on the benchmark.

NumbersGains vs ReAct: +13.2%, +66.1%, +80.3% (simple/multi/multi-turn)

Practical UseSwitching to DrugPilot-style PMP and Fe-Fo can substantially raise end-to-end tool-calling success vs common agent patterns.

Evidence RefAbstract; Results

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Task completion rate (TCDD)	98.0% (simple), 93.5% (multi-tool), 64.0% (multi-turn)	ReAct	+13.2%, +66.1%, +80.3% vs ReAct	TCDD test set (3 categories, 100 queries each)	Abstract; Results	Table 1; Fig.3
Accuracy	Acc.F 95.0%, Acc.P 93.7%	without SFT/Fe-Fo	Tool-selection +28.7%, Parameter-extraction +44.9%	Ablation on Llama3.1-8B	Ablation study; Fig.3d	Fig.3d

What To Try In 7 Days

Prototype a key-value memory pool for your tool inputs to avoid context bloat.

Fine-tune an LLM on a small tool-calling instruction set (LoRA) to reduce parameter errors.

Add a lightweight feedback checker that validates tool names/params and prompts corrective output when errors occur.

Agent Features

Memory

parameterized memory pool (key-value store)supports CRUD and large-batch retrieval

Planning

autonomous multi-stage planningworkflow orchestration

Tool Use

function calling (JSON tool calls)tool selection and parameter passingtool invocation verification via Fe-Fo

Frameworks

LoRAOllama deploymentFe-Fo feedback-focus

Is Agentic

Yes

Architectures

LLM-based agentparameterized memory pool (PMP)

Collaboration

human-in-the-loop parameter editingintegration with external models (AI model zoo)

Optimization Features

Token Efficiency

PMP moves large parameters out of LLM context to save tokens

Infra Optimization

Ollama deployment for inference-stage models

Model Optimization

LoRA

System Optimization

Fe-Fo reduces retries by detecting and fixing format/parameter errors

Training Optimization

Accuracy

Inference Optimization

Reduced context length via PMP leading to faster inference

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/wzn99/DrugPilot

Data URLs

https://drive.google.com/file/d/1JthOkIAzuuaajZhgH03e9TfM9KwBHmni/view?usp=sharing

Risks & Boundaries

Limitations

TCDD is a synthetic instruction dataset; generalization to messy real-world dialogs is not fully proven.

Case studies are limited to in-silico tasks (GDSCv2, BACE); wet-lab validation is absent.

When Not To Use

For regulated clinical decisions that need audited provenance and wet-lab validation.

If you cannot fine-tune models or cannot host inference (no Ollama/compute).

Failure Modes

If PMP keys are missing or misnamed, the agent may select wrong parameters.

LLMs can still hallucinate function names or parameter formats if the memory prompt is ignored.

Core Entities

Models

Llama3.1-8BLlama3-8BMistral-NeMoGemma2Qwen2DeepSeek-LLM-7BDeepSeek-R1ChatGPT-4o

Metrics

Accuracytask completion ratelatency (seconds)

Datasets

TCDD (tool-calling dataset, 2,800 samples)GDSCv2 (case study)BACE (case study)

Benchmarks

Berkeley function-calling leaderboard (customized evaluation on TCDD)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

High task-completion on TCDD tool-calling benchmark.

Large improvements over a SOTA agent (ReAct) on the benchmark.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Key finding

Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

Key finding

Generate editable BIM models from plain language by orchestrating LLM agents that write modeling code

Key finding