DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

May 20, 20257 min

Overview

Decision SnapshotNeeds Validation

The paper provides clear ablations and a custom benchmark showing big gains for tool-calling; evidence is strong on synthetic and case-study tasks but limited on large-scale real-world wet-lab pipelines.

Citations3

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Kun Li, Zhennan Wu, Shoupeng Wang, Jia Wu, Shirui Pan, Wenbin Hu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

DrugPilot cuts manual tool switching and context failures by structuring inputs as key-value parameters, improving automation accuracy and runtime for multi-step drug workflows.

Who Should Care

Summary TLDR

DrugPilot is an LLM-based agent designed for end-to-end drug discovery workflows. It introduces a parameterized memory pool (PMP) that stores large multimodal drug data as key-value pairs, and a feedback-focus (Fe-Fo) mechanism that checks and corrects tool calls. The authors release TCDD, a 2,800-sample tool-calling dataset, fine-tune LLMs, and show major gains on a function-calling benchmark: task completion rates of 98.0% (simple), 93.5% (multi-tool), and 64.0% (multi-turn). PMP removes large parameters from the model context, enabling large-batch processing and faster, more accurate multi-step tool use. Code and data links are provided.

Problem Statement

LLM agents struggle when drug discovery tasks need large multimodal inputs, precise tool calls, and multi-turn workflows. Text-only memory overloads the model context, tool selection and parameter passing become error-prone, and users without coding skills cannot chain domain tools reliably.

Main Contribution

Parameterized Memory Pool (PMP): a key-value store that removes large parameters from LLM context and supplies structured inputs to tools.

Feedback-Focus (Fe-Fo): an error-detection and feedback loop that restates tasks and provides corrective prompts for tool-calling mistakes.

Key Findings

High task-completion on TCDD tool-calling benchmark.

NumbersTask completion: 98.0% (simple), 93.5% (multi-tool), 64.0% (multi-turn)

Practical UseExpect much fewer manual corrections when automating single- and multi-tool drug workflows on evaluated benchmarks.

Evidence RefAbstract; Results; Table 1

Large improvements over a SOTA agent (ReAct) on the benchmark.

NumbersGains vs ReAct: +13.2%, +66.1%, +80.3% (simple/multi/multi-turn)

Practical UseSwitching to DrugPilot-style PMP and Fe-Fo can substantially raise end-to-end tool-calling success vs common agent patterns.

Evidence RefAbstract; Results

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Task completion rate (TCDD)98.0% (simple), 93.5% (multi-tool), 64.0% (multi-turn)ReAct+13.2%, +66.1%, +80.3% vs ReActTCDD test set (3 categories, 100 queries each)Abstract; ResultsTable 1; Fig.3
AccuracyAcc.F 95.0%, Acc.P 93.7%without SFT/Fe-FoTool-selection +28.7%, Parameter-extraction +44.9%Ablation on Llama3.1-8BAblation study; Fig.3dFig.3d

What To Try In 7 Days

Prototype a key-value memory pool for your tool inputs to avoid context bloat.

Fine-tune an LLM on a small tool-calling instruction set (LoRA) to reduce parameter errors.

Add a lightweight feedback checker that validates tool names/params and prompts corrective output when errors occur.

Agent Features

Memory
parameterized memory pool (key-value store)supports CRUD and large-batch retrieval
Planning
autonomous multi-stage planningworkflow orchestration
Tool Use
function calling (JSON tool calls)tool selection and parameter passingtool invocation verification via Fe-Fo
Frameworks
LoRAOllama deploymentFe-Fo feedback-focus
Is Agentic

Yes

Architectures
LLM-based agentparameterized memory pool (PMP)
Collaboration
human-in-the-loop parameter editingintegration with external models (AI model zoo)

Optimization Features

Token Efficiency
PMP moves large parameters out of LLM context to save tokens
Infra Optimization
Ollama deployment for inference-stage models
Model Optimization
LoRA
System Optimization
Fe-Fo reduces retries by detecting and fixing format/parameter errors
Training Optimization
Accuracy
Inference Optimization
Reduced context length via PMP leading to faster inference

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

TCDD is a synthetic instruction dataset; generalization to messy real-world dialogs is not fully proven.

Case studies are limited to in-silico tasks (GDSCv2, BACE); wet-lab validation is absent.

When Not To Use

For regulated clinical decisions that need audited provenance and wet-lab validation.

If you cannot fine-tune models or cannot host inference (no Ollama/compute).

Failure Modes

If PMP keys are missing or misnamed, the agent may select wrong parameters.

LLMs can still hallucinate function names or parameter formats if the memory prompt is ignored.

Core Entities

Models

Llama3.1-8BLlama3-8BMistral-NeMoGemma2Qwen2DeepSeek-LLM-7BDeepSeek-R1ChatGPT-4o

Metrics

Accuracytask completion ratelatency (seconds)

Datasets

TCDD (tool-calling dataset, 2,800 samples)GDSCv2 (case study)BACE (case study)

Benchmarks

Berkeley function-calling leaderboard (customized evaluation on TCDD)