Tool Planning Papers — Parsed & Scored for Practitioners

Let LLMs translate problems and a classical planner find correct, often optimal, plans

0.70

0.60

0.70

84

LLM+P turns LLMs into reliable natural-language front ends for proven symbolic planners. That reduces execution risk and often lowers real-world costs (e.g., fewer extra robot trips). It avoids expensive LLM fine-tuning by delegating correctness to existing planners.

Key finding

LLM+P produced correct or optimal plans in most evaluated domains while LLM-only methods usually failed.

Numbers: BLOCKSWORLD 90% (LLM 15–20%); GRIPPERS 95% (LLM 35%) ; STORAGE 85% (LLM 0%)

ToolBench + DFSDT + retriever teach LLaMA-2 to use 16k+ real REST APIs with ChatGPT-based annotation and evaluation

0.70

63

If you build assistants that call external services, training on many real APIs plus a retriever and multi-path planning dramatically reduces manual engineering and makes open-source models practically competitive with closed systems.

Key finding

ToolBench covers 16,464 real REST APIs and 126,486 labeled instruction→solution pairs.

Numbers: 16,464 APIs; 126,486 instances; 469,585 real API calls

ToolQA — a benchmark that forces LLMs to use external tools, not memorized facts

0.30

0.40

0.20

39

If your product must use live or private data, you need tested tool integration and source selection; relying on a base LLM risks wrong or outdated answers.

Key finding

Standard LLMs that do not use external tools fail on ToolQA.

Numbers: ChatGPT avg success: 5.6% (easy), ~2% (hard)

OptiGuide: use LLMs to translate plain-English what‑if questions into solver code and human explanations without sending private data

0.70

0.50

0.70

32

OptiGuide speeds what‑if and root‑cause analysis for planners, reduces engineer on‑call cycles, and keeps sensitive data in‑house while surfacing solver decisions in plain English.

Key finding

GPT‑4 achieves high accuracy answering quantitative supply‑chain questions when given examples in the prompt.

Numbers: ≈93% average accuracy (GPT‑4, in‑distribution)

API-Bank: a large, runnable benchmark and training set to measure and improve LLMs' API/tool use; includes Lynx, a fine-tuned model.

0.50

0.70

0.80

12

API-Bank lets product and engineering teams measure how reliably models call real APIs, cheaply create high-quality tool-use training data, and reduce costly manual labeling. Improving API reliability cuts user-facing errors like failed requests or wrong side effects.

Key finding

API-Bank provides an executable evaluation set: 73 APIs, 314 manually reviewed dialogues, 753 API calls.

Numbers: 73 APIs; 314 dialogues; 753 API calls

PHIA: an agent that uses code + web search to turn wearable time-series into personalized health insights

0.50

0.65

0.40

7

Agentic LLMs that run verified code and fetch trusted web facts can unlock personalized insights from wearable data—improving product value for health apps while reducing numeric errors and buggy analyses.

Key finding

PHIA answers objective numeric wearable queries with high accuracy

Numbers: 84% exact-match accuracy on 4,000 objective queries

SheetCopilot: turn natural language into step-by-step spreadsheet actions using LLMs

0.60

0.65

0.45

6

SheetCopilot lets non-technical users automate many spreadsheet tasks by speaking plain English, lowering manual work and reducing mistakes, but it still needs human verification for critical data because full correctness is about 44% on evaluated tasks.

Key finding

High execution but moderate full correctness for GPT-3.5-Turbo with SheetCopilot.

Numbers: Exec@1 = 87.3%, Pass@1 = 44.3% (full 221 tasks)

Combine MCTS + AI self-critique + offline DPO to train web agents that learn from search traces

0.60

0.70

5

Train web agents from their own search traces to get large, fast gains in task success without risky online RL; pairing a trained policy with online search gives near‑perfect results for structured web tasks.

Key finding

Agent Q plus test-time MCTS reaches human-level or better performance on WebShop.

Numbers: WebShop success: base 28.6% → Agent Q+MCTS 50.5% (human 50.0%)

Train LLMs to plan with abstract placeholders, then fill them with tools to reason faster and more accurately

0.70

0.60

5

CoA makes multi-step tool use both more accurate and faster by separating plan generation from tool calls; this reduces arithmetic bugs and shortens latency when pipelines must call external APIs.

Key finding

CoA improves QA accuracy on evaluated math benchmarks.

Numbers: GSM8K: +~2.9–~6.8 pp absolute (varies by model); average ~7.5% reported

SWIFT: an open-source, one-stop framework to fine-tune, evaluate, quantize and deploy over 550 LLMs and 200+ MLLMs

0.70

0.40

0.70

5

SWIFT unifies fine-tuning, RL-style alignment, quantization and deployment for text and multimodal models. That reduces engineering overhead, accelerates experiments on agents and lets teams run many models and tuners without building custom glue code.

Key finding

SWIFT already supports a very large model and dataset surface.

Numbers: 550+ LLMs, 200+ MLLMs, ~150+ datasets (paper claims)

AutoFlow: automatically generate readable natural‑language workflows so LLM agents solve complex tasks with less human work

0.50

0.60

4

AutoFlow reduces manual workflow design time by automatically producing readable, executable agent workflows and can raise task performance on image/text benchmarks, lowering operational cost for multi‑step agent tasks.

Key finding

AutoFlow improves average OpenAGI score compared to manual CoRE when Mixtral is the interpreter and GPT‑4 is the generator.

Numbers: avg 0.3597 vs 0.2483 (Δ +0.1114, +44.9%)

Survey: how LLMs learn to use external tools — workflow, benchmarks, and open problems

0.60

0.50

0.60

4

Linking LLMs to real tools turns language models from static answerers into actionable assistants that fetch fresh facts, run domain tools, automate workflows, and show decision steps—improving trust and utility in products.

Key finding

The survey reviewed more than 150 papers on tool learning.

Numbers: 150+ papers reviewed

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

0.60

0.70

0.60

3

Automates converting published code into reusable tools, cutting expert setup time and enabling rapid integration of domain-specific methods into production agents at a low per-tool cost.

Key finding

TOOLMAKER implemented 80% of benchmark tools (12 of 15)

Numbers: 12/15 tools correct

T-Eval: a stepwise benchmark that breaks LLM tool use into six measurable abilities

0.70

0.45

0.60

3

T-Eval gives runnable, per-skill diagnostics for building LLM-based tool agents so teams can pinpoint whether problems come from planning, choosing tools, formatting requests, or checking results.

Key finding

Top commercial models lead overall tool-use performance.

Numbers: GPT-4 overall 86.4; GPT-3.5 84.0; Claude2 78.8

Sum2Act: a router + state-manager pipeline that makes LLMs call many real APIs reliably

0.60

0.50

3

If your product needs reliable multi-step interactions with many third-party APIs (search, image tools, web services), a small router + summarizing state manager can boost success and reduce repeated failures with little engineering overhead.

Key finding

Sum2Act raises average Pass Rate to 70.0% on ToolBench using ChatGPT

Numbers: Pass Rate avg: Sum2Act 70.0% vs DFSDT 67.0% vs ReAct 41.1%

EHRAgent — an LLM that writes, runs, and debugs Python to answer complex EHR table queries with four-shot prompts

0.60

0.50

2

EHRAgent reduces dependence on data engineers by letting clinicians ask EHR questions in plain language and getting accurate answers; this can speed workflows but increases runtime/API calls and needs privacy safeguards.

Key finding

EHRAgent substantially improves EHR multi-table QA success rates versus prior LLM agent baselines.

Numbers: Up to +29.6 percentage points success rate (TREQS) vs strongest baseline

Agent-SafetyBench: 2,000 agent tests across 349 environments — no tested agent exceeds 60% safety

0.40

0.55

0.60

2

Tool-using LLM agents can make costly or unsafe actions; organizations should not deploy them without runtime checks, human-in-loop gating, and improved evaluation tailored to agent behavior.

Key finding

No tested LLM agent exceeds 60% safety on Agent-SafetyBench.

Numbers: Best model 59.8% (Claude-3-Opus); all <60%

ToolTalk: a small automated benchmark for measuring multi-step tool use in dialogs

0.50

0.60

0.50

2

If you plan to automate user tasks with LLMs, expect frequent multi-step failures and risky incorrect side effects; instrument tool calls and add verification before irreversible actions.

Key finding

Multi-step tool use is still hard: GPT-4 achieves only 50% success on hard conversations.

Numbers: GPT-4 success rate 50% (hard)

ChatCRS: add a knowledge retriever and a goal planner to make LLMs useful conversational recommenders

0.60

0.70

0.40

2

If you want LLMs to make real product recommendations in a specific domain, wrap them with a KB retriever and a goal planner; that combination turns an LLM from brittle zero-shot text generator into a materially better recommender on evaluated datasets.

Key finding

External knowledge massively improves recommendation ranking for LLMs on DuRecDial.

Numbers: ChatGPT NDCG@10: DG 0.024 -> Oracle 0.617 (DuRecDial, Table 1)

Teach LLMs to plan API call sequences, then generate executable code to answer academic queries faster and more reliably.

0.80

0.70

2

SoAy converts complex multi-API academic queries into a short plan plus executable code, cutting latency and improving answer reliability—useful for search services, institutional dashboards, and any product that needs precise, API-backed facts.

Key finding

SoAyLLaMA (Code-13B) achieved the top automated score on SoAyBench.

Numbers: Score 92.74% (Code-13B, Table 3)

A stateful, conversational benchmark that tests LLMs using tools in live multi-turn dialogs

0.60

0.70

0.50

2

ToolSandbox tests realistic, multi-turn tool use and highlights where agents hallucinate, fail to sequence dependent actions, or mis-handle time—insights you need before putting LLM agents in customer-facing automation.

Key finding

Proprietary models outperform open-source models by a large margin on ToolSandbox tasks.

Numbers: Top scores: GPT-4o 73.0 vs Hermes 31.4 (Table 5)

Use an agent + MCTS to explore a knowledge base, auto-label data, and cut labeled-data needs for KBQA

0.70

0.75

0.50

1

If you must run KBQA with limited labeled data, KBQA-o1 lets you use open LLMs and automated MCTS exploration to generate high-quality training pairs and sharply improve accuracy on complex queries.

Key finding

KBQA-o1 dramatically improves low-resource GrailQA accuracy versus prior methods.

Numbers: GrailQA F1 78.5% (Llama-3.1-8B) vs 48.5% prior low-resource best (GPT-3.5-turbo)

Train agents to internalize human hints so they stop relying on ever-growing prompts

0.70

0.60

0.80

1

You can convert repeated human guidance into model updates that reduce prompt length, cut inference cost, and raise multi-tool task reliability with modest annotation work.

Key finding

After three rounds MNM achieves 97.9% success on ToolQA.

Numbers: 97.9% success (Table 2, Round 3)

Harmonia: an LLM-driven agent that interactively builds reproducible data harmonization pipelines

0.40

0.60

0.50

1

Agentic harmonization can speed up combining heterogeneous datasets and produce reusable, publishable transformation scripts that improve reproducibility and reduce manual engineering time.

Key finding

Harmonia produced perfect schema-matching on the evaluated use case.

Numbers: Schema accuracy Harmonia=1.00 vs Baseline=0.88