Survey: how LLMs learn to use external tools — workflow, benchmarks, and open problems

May 28, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

4

Authors

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, Ji-Rong Wen

Links

Abstract / PDF

Why It Matters For Business

Linking LLMs to real tools turns language models from static answerers into actionable assistants that fetch fresh facts, run domain tools, automate workflows, and show decision steps—improving trust and utility in products.

Summary TLDR

This paper is a focused, up-to-date survey of 'tool learning' for large language models (LLMs). It explains why connecting LLMs to external tools matters (knowledge updates, domain expertise, automation, multi-modal input, interpretability, robustness), organizes methods around a four-stage workflow (task planning, tool selection, tool calling, response generation), catalogs 33+ benchmarks and toolkits, and lists major gaps: latency, benchmark realism, tool quality, safety, and unified frameworks. The survey reviews over 150 papers and points to practical toolkits such as LangChain and Auto-GPT.

Problem Statement

LLM capabilities grow but remain limited by fixed knowledge, weak specialty skills, inability to act, and brittle inputs. Research on letting LLMs call external tools is fast but fragmented. Practitioners lack a clear, unified map of methods, benchmarks, and open gaps to build reliable tool-augmented systems.

Main Contribution

Organizes tool learning around a four-stage workflow: task planning, tool selection, tool calling, response generation.

Summarizes why tools help LLMs across six concrete benefits: knowledge, expertise, automation, interaction, interpretability, robustness.

Catalogs 33+ benchmarks and popular toolkits; highlights dataset sizes and limitations (e.g., ToolBench2: 16,464 tools).

Classifies methods as tuning-free vs tuning-based at each stage and compares strengths and trade-offs.

Identifies practical challenges and promising directions: latency, evaluation rigor, tool coverage, safety, and multi-modal tool use.

Key Findings

The survey reviewed more than 150 papers on tool learning.

Numbers150+ papers reviewed

ToolBench2 is currently the largest public tool-learning dataset.

Numbers16,464 tools; 126,486 instances

Many existing benchmarks use LLM-generated queries, which can misrepresent real user needs.

Two dominant implementation paradigms exist: one-step planning and iterative (feedback-driven) planning.

Tool retrieval is commonly split into sparse (term-based) and dense (semantic) methods; retriever+LLM is a common practical design.

Who Should Care

What To Try In 7 Days

Run a two-stage tool pipeline: add a fast retriever (BM25/embeddings) to pick 5 candidate APIs and let your LLM choose among them.

Use an existing toolkit (LangChain or BMTools) to prototype a single-use case (e.g., pricing lookup) and measure latency and failure modes.

Evaluate with both synthetic and a small set of 100 real user queries to spot benchmark gaps and mismatches.

Agent Features

Memory

  • Retrieval memory (RAG-style external knowledge)

Planning

  • One-step (plan once) planning
  • Iterative (feedback-driven) planning
  • Decision-tree / search-based planning

Tool Use

  • API function calls
  • Plugin-based tool invocation
  • Chained serial tool calls

Frameworks

  • LangChain
  • Auto-GPT
  • BabyAGI
  • BMTools

Is Agentic

true

Architectures

  • Retriever + LLM pipeline
  • Tool graph / directed-graph navigation
  • Multi-agent collaborative frameworks

Collaboration

  • Multi-agent cooperation (specialized execution agents)

Optimization Features

Token Efficiency

  • Context compression and output summarization (ReCOMP, Xu et al.)

System Optimization

  • Two-stage retrieval + LLM selection to manage large tool sets

Training Optimization

  • Finetuning with API-call datasets (Toolformer, ToolLLaMA)
  • RLHF and execution-feedback tuning

Inference Optimization

  • Retriever narrowing to reduce context and latency
  • Intent-driven gating for tool selection

Reproducibility

Data Urls

  • Benchmarks referenced have public links (see Table 1; e.g., ToolBench2, API-Bank)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey relies on many recent papers; the field evolves rapidly and new benchmarks may appear after publication.
  • Benchmarks cataloged often contain synthetic, LLM-generated queries that may not match live user behavior.
  • Tool datasets include many inaccessible or nonfunctional APIs, limiting direct reproducibility.
  • Comparative quantitative evaluation of methods across stages is limited or missing.

When Not To Use

  • When you require verified, safety-critical behavior without independent tool-output validation.
  • When latency must be sub-second and external API calls introduce delay.
  • If you cannot access open-source LLMs for tuning but require fine-tuned tool behavior.

Failure Modes

  • Hallucinations driven by erroneous or malicious tool outputs.
  • Tool-call format errors and parameter mis-parsing causing failed calls.
  • Poor tool retrieval leading to missing or incomplete tool sets.
  • High latency or tool unavailability degrading user experience.

Core Entities

Models

  • GPT-4
  • GPT-J
  • LLaMA
  • LLaMA-7B
  • ToolLLaMA
  • Toolformer

Metrics

  • Recall@K
  • NDCG@K
  • COMP@K
  • BLEU
  • ROUGE-L
  • Exact Match
  • F1

Datasets

  • ToolBench2
  • API-Bank
  • APIBench
  • ToolBench1
  • ToolAlpaca
  • RestBench
  • ToolEyes
  • Seal-Tools

Benchmarks

  • API-Bank
  • APIBench
  • ToolBench1
  • ToolBench2
  • ToolEyes
  • MetaTool
  • RestBench
  • T-Eval
  • UltraTool

Context Entities

Models

  • Gorilla
  • ToolNet
  • ProTIP
  • CRAFT

Metrics

  • Pass rate (ChatGPT)
  • Human evaluation
  • ToolEval automated scoring

Datasets

  • ToolQA
  • TaskBench
  • API-BLEND
  • StableToolBench
  • SciToolBench

Benchmarks

  • RoTBench
  • ToolSword
  • MLLM-Tool
  • SoAyBench
  • ToolLens