DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

May 20, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

3

Authors

Kun Li, Zhennan Wu, Shoupeng Wang, Jia Wu, Shirui Pan, Wenbin Hu

Links

Abstract / PDF

Why It Matters For Business

DrugPilot cuts manual tool switching and context failures by structuring inputs as key-value parameters, improving automation accuracy and runtime for multi-step drug workflows.

Summary TLDR

DrugPilot is an LLM-based agent designed for end-to-end drug discovery workflows. It introduces a parameterized memory pool (PMP) that stores large multimodal drug data as key-value pairs, and a feedback-focus (Fe-Fo) mechanism that checks and corrects tool calls. The authors release TCDD, a 2,800-sample tool-calling dataset, fine-tune LLMs, and show major gains on a function-calling benchmark: task completion rates of 98.0% (simple), 93.5% (multi-tool), and 64.0% (multi-turn). PMP removes large parameters from the model context, enabling large-batch processing and faster, more accurate multi-step tool use. Code and data links are provided.

Problem Statement

LLM agents struggle when drug discovery tasks need large multimodal inputs, precise tool calls, and multi-turn workflows. Text-only memory overloads the model context, tool selection and parameter passing become error-prone, and users without coding skills cannot chain domain tools reliably.

Main Contribution

Parameterized Memory Pool (PMP): a key-value store that removes large parameters from LLM context and supplies structured inputs to tools.

Feedback-Focus (Fe-Fo): an error-detection and feedback loop that restates tasks and provides corrective prompts for tool-calling mistakes.

TCDD dataset: 2,800 annotated instruction samples covering 8 core drug discovery tools for fine-tuning and evaluation.

Benchmarks and results: shows higher function and parameter accuracy and lower latency than baselines across simple, multi-function, and multi-turn categories.

Open release: code and dataset links provided for reproduction and adoption.

Key Findings

High task-completion on TCDD tool-calling benchmark.

NumbersTask completion: 98.0% (simple), 93.5% (multi-tool), 64.0% (multi-turn)

Large improvements over a SOTA agent (ReAct) on the benchmark.

NumbersGains vs ReAct: +13.2%, +66.1%, +80.3% (simple/multi/multi-turn)

PMP enables handling very large parameter batches.

NumbersProcessed 91 molecules (avg len 52); ChatGPT-4o broke above 51 molecules

SFT and Fe-Fo materially improve correctness and speed.

NumbersTool-selection 95.0% and parameter-extraction 93.7%; latency fell 30.91s → 15.18s

Results

Task completion rate (TCDD)

Value98.0% (simple), 93.5% (multi-tool), 64.0% (multi-turn)

BaselineReAct

Accuracy

ValueAcc.F 95.0%, Acc.P 93.7%

Baselinewithout SFT/Fe-Fo

Multi-turn latency

ValueAverage < 20s for DrugPilot

BaselineOther agents > 40s

Who Should Care

What To Try In 7 Days

Prototype a key-value memory pool for your tool inputs to avoid context bloat.

Fine-tune an LLM on a small tool-calling instruction set (LoRA) to reduce parameter errors.

Add a lightweight feedback checker that validates tool names/params and prompts corrective output when errors occur.

Agent Features

Memory

  • parameterized memory pool (key-value store)
  • supports CRUD and large-batch retrieval

Planning

  • autonomous multi-stage planning
  • workflow orchestration

Tool Use

  • function calling (JSON tool calls)
  • tool selection and parameter passing
  • tool invocation verification via Fe-Fo

Frameworks

  • LoRA
  • Ollama deployment
  • Fe-Fo feedback-focus

Is Agentic

true

Architectures

  • LLM-based agent
  • parameterized memory pool (PMP)

Collaboration

  • human-in-the-loop parameter editing
  • integration with external models (AI model zoo)

Optimization Features

Token Efficiency

  • PMP moves large parameters out of LLM context to save tokens

Infra Optimization

  • Ollama deployment for inference-stage models

Model Optimization

  • LoRA

System Optimization

  • Fe-Fo reduces retries by detecting and fixing format/parameter errors

Training Optimization

  • Accuracy

Inference Optimization

  • Reduced context length via PMP leading to faster inference

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • TCDD is a synthetic instruction dataset; generalization to messy real-world dialogs is not fully proven.
  • Case studies are limited to in-silico tasks (GDSCv2, BACE); wet-lab validation is absent.
  • Some results depend on fine-tuning (SFT); without it performance drops.

When Not To Use

  • For regulated clinical decisions that need audited provenance and wet-lab validation.
  • If you cannot fine-tune models or cannot host inference (no Ollama/compute).
  • When tool outputs require strict cryptographic or compliance controls not supported by PMP.

Failure Modes

  • If PMP keys are missing or misnamed, the agent may select wrong parameters.
  • LLMs can still hallucinate function names or parameter formats if the memory prompt is ignored.
  • Performance drops sharply without SFT or Fe-Fo components.

Core Entities

Models

  • Llama3.1-8B
  • Llama3-8B
  • Mistral-NeMo
  • Gemma2
  • Qwen2
  • DeepSeek-LLM-7B
  • DeepSeek-R1
  • ChatGPT-4o

Metrics

  • Accuracy
  • task completion rate
  • latency (seconds)

Datasets

  • TCDD (tool-calling dataset, 2,800 samples)
  • GDSCv2 (case study)
  • BACE (case study)

Benchmarks

  • Berkeley function-calling leaderboard (customized evaluation on TCDD)