Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
3
Why It Matters For Business
DrugPilot cuts manual tool switching and context failures by structuring inputs as key-value parameters, improving automation accuracy and runtime for multi-step drug workflows.
Summary TLDR
DrugPilot is an LLM-based agent designed for end-to-end drug discovery workflows. It introduces a parameterized memory pool (PMP) that stores large multimodal drug data as key-value pairs, and a feedback-focus (Fe-Fo) mechanism that checks and corrects tool calls. The authors release TCDD, a 2,800-sample tool-calling dataset, fine-tune LLMs, and show major gains on a function-calling benchmark: task completion rates of 98.0% (simple), 93.5% (multi-tool), and 64.0% (multi-turn). PMP removes large parameters from the model context, enabling large-batch processing and faster, more accurate multi-step tool use. Code and data links are provided.
Problem Statement
LLM agents struggle when drug discovery tasks need large multimodal inputs, precise tool calls, and multi-turn workflows. Text-only memory overloads the model context, tool selection and parameter passing become error-prone, and users without coding skills cannot chain domain tools reliably.
Main Contribution
Parameterized Memory Pool (PMP): a key-value store that removes large parameters from LLM context and supplies structured inputs to tools.
Feedback-Focus (Fe-Fo): an error-detection and feedback loop that restates tasks and provides corrective prompts for tool-calling mistakes.
TCDD dataset: 2,800 annotated instruction samples covering 8 core drug discovery tools for fine-tuning and evaluation.
Benchmarks and results: shows higher function and parameter accuracy and lower latency than baselines across simple, multi-function, and multi-turn categories.
Open release: code and dataset links provided for reproduction and adoption.
Key Findings
High task-completion on TCDD tool-calling benchmark.
Large improvements over a SOTA agent (ReAct) on the benchmark.
PMP enables handling very large parameter batches.
SFT and Fe-Fo materially improve correctness and speed.
Results
Task completion rate (TCDD)
Accuracy
Multi-turn latency
Who Should Care
What To Try In 7 Days
Prototype a key-value memory pool for your tool inputs to avoid context bloat.
Fine-tune an LLM on a small tool-calling instruction set (LoRA) to reduce parameter errors.
Add a lightweight feedback checker that validates tool names/params and prompts corrective output when errors occur.
Agent Features
Memory
- parameterized memory pool (key-value store)
- supports CRUD and large-batch retrieval
Planning
- autonomous multi-stage planning
- workflow orchestration
Tool Use
- function calling (JSON tool calls)
- tool selection and parameter passing
- tool invocation verification via Fe-Fo
Frameworks
- LoRA
- Ollama deployment
- Fe-Fo feedback-focus
Is Agentic
true
Architectures
- LLM-based agent
- parameterized memory pool (PMP)
Collaboration
- human-in-the-loop parameter editing
- integration with external models (AI model zoo)
Optimization Features
Token Efficiency
- PMP moves large parameters out of LLM context to save tokens
Infra Optimization
- Ollama deployment for inference-stage models
Model Optimization
- LoRA
System Optimization
- Fe-Fo reduces retries by detecting and fixing format/parameter errors
Training Optimization
- Accuracy
Inference Optimization
- Reduced context length via PMP leading to faster inference
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- TCDD is a synthetic instruction dataset; generalization to messy real-world dialogs is not fully proven.
- Case studies are limited to in-silico tasks (GDSCv2, BACE); wet-lab validation is absent.
- Some results depend on fine-tuning (SFT); without it performance drops.
When Not To Use
- For regulated clinical decisions that need audited provenance and wet-lab validation.
- If you cannot fine-tune models or cannot host inference (no Ollama/compute).
- When tool outputs require strict cryptographic or compliance controls not supported by PMP.
Failure Modes
- If PMP keys are missing or misnamed, the agent may select wrong parameters.
- LLMs can still hallucinate function names or parameter formats if the memory prompt is ignored.
- Performance drops sharply without SFT or Fe-Fo components.
Core Entities
Models
- Llama3.1-8B
- Llama3-8B
- Mistral-NeMo
- Gemma2
- Qwen2
- DeepSeek-LLM-7B
- DeepSeek-R1
- ChatGPT-4o
Metrics
- Accuracy
- task completion rate
- latency (seconds)
Datasets
- TCDD (tool-calling dataset, 2,800 samples)
- GDSCv2 (case study)
- BACE (case study)
Benchmarks
- Berkeley function-calling leaderboard (customized evaluation on TCDD)

