Function Calling Papers — Parsed & Scored for Practitioners

ToolBench + DFSDT + retriever teach LLaMA-2 to use 16k+ real REST APIs with ChatGPT-based annotation and evaluation

0.70

63

If you build assistants that call external services, training on many real APIs plus a retriever and multi-path planning dramatically reduces manual engineering and makes open-source models practically competitive with closed systems.

Key finding

ToolBench covers 16,464 real REST APIs and 126,486 labeled instruction→solution pairs.

Numbers: 16,464 APIs; 126,486 instances; 469,585 real API calls

Generate editable BIM models from plain language by orchestrating LLM agents that write modeling code

0.60

6

Text2BIM lets designers describe early-stage buildings in plain language and get editable BIM models, reducing manual modeling effort and speeding concept-to-BIM workflows while preserving the ability to refine results in standard BIM tools.

Key finding

The framework produced editable IFC/BIM models for 25 diverse prompts with 534 generated runs.

Numbers: 534 IFC models generated (25 prompts × 3 LLMs × 3 repeats incl. intermediate runs)

xLAM: open-source models (1B–141B) plus a unified function-calling data pipeline that tops the Berkeley Function-Calling leaderboard

0.80

0.60

0.70

4

xLAM provides production-ready, open-source agent models and a reusable data pipeline that reduce dependence on proprietary models for function-calling and tool-heavy workflows, enabling lower-cost deployment and reproducible tool integration.

Key finding

Top overall accuracy on Berkeley Function-Calling Leaderboard v2.

Numbers: 87.31% overall accuracy (xLAM-8x22b-r, BFCL v2 cutoff 09/03/2024)

ClinicalAgent: a GPT-4 multi-agent system that uses external databases to predict clinical trial outcomes

0.40

0.60

0.50

3

ClinicalAgent shows a practical path to combine LLM reasoning and domain databases to flag risky trials and estimate enrollment, speeding early-stage decisions; validate on larger data before clinical use.

Key finding

ClinicalAgent raised precision-recall performance over direct GPT prompting.

Numbers: PR-AUC 0.7908 (+0.3326 vs GPT-4 prompt)

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

0.60

0.50

3

DrugPilot cuts manual tool switching and context failures by structuring inputs as key-value parameters, improving automation accuracy and runtime for multi-step drug workflows.

Key finding

High task-completion on TCDD tool-calling benchmark.

Numbers: Task completion: 98.0% (simple), 93.5% (multi-tool), 64.0% (multi-turn)

ToolACE: auto-generates 26k verified APIs and complex dialogs to teach LLMs reliable function calling

0.70

0.60

2

ToolACE lets mid-size LLMs (8B) learn practical API use by supplying large, diverse, and verified synthetic tool data—reducing reliance on proprietary APIs and enabling in-house fine-tuned agents for automation tasks.

Key finding

ToolACE builds a very large synthetic API pool.

Numbers: 26,507 APIs across 390 domains

Prune, heal, and quantize a 3.8B SLM to run reliable on-device vehicle function-calling at 11 t/s

0.70

0.60

0.70

2

You can run a function-calling LLM on existing vehicle CPUs with small memory and faster integration to new features, cutting cloud costs and reducing latency while improving function-call accuracy over a production speech baseline.

Key finding

You can remove roughly 2 billion parameters from Phi-3 mini while keeping task accuracy for function-calling.

Numbers: Phi3-3.8B → Phi3-1.8B (≈2B removed)

ToolTalk: a small automated benchmark for measuring multi-step tool use in dialogs

0.50

0.60

0.50

2

If you plan to automate user tasks with LLMs, expect frequent multi-step failures and risky incorrect side effects; instrument tool calls and add verification before irreversible actions.

Key finding

Multi-step tool use is still hard: GPT-4 achieves only 50% success on hard conversations.

Numbers: GPT-4 success rate 50% (hard)

Practical blueprint for making enterprise APIs 'agent-ready' for autonomous AI agents

0.50

0.60

1

If you plan to let AI agents use your APIs, you must redesign endpoints, headers, and governance now to avoid outages, security gaps, and surprise costs.

Key finding

Traditional REST/GraphQL/gRPC APIs are poorly matched to autonomous, iterative agent behavior.

MultiAPI: a 2,038-prompt, 235-function benchmark that shows LLMs know when to call tools but struggle to pick the right tool and arguments

0.40

0.60

0.35

1

Tool-augmented LLMs can detect when to call external multimodal tools but often select the wrong tool or give bad arguments; validate tool selection and add argument checks before shipping to avoid broken user-facing features.

Key finding

LLMs reliably detect when to call an API.

Numbers: GPT-3.5 invoke accuracy = 99.82% (Table 2)

Fine-tune an LLM to parse causal questions, call causal tools, and explain results end-to-end

0.60

0.50

0.60

1

Fine-tuning an open LLM with targeted input-output examples turns it into a reliable causal assistant that extracts task details, runs analysis code, and explains results—reducing time to insight for analysts and lowering dependence on closed APIs.

Key finding

LLM4Causal-Mixed achieved much higher end-to-end accuracy (win rate) than GPT-4 on synthetic causal tasks

Numbers: Win rate avg 0.806 vs GPT4 avg ~0.12 (Table 2)

AnyTool: GPT-4 agent that searches 16k+ APIs via hierarchical retrieval and self-reflection

0.60

0.70

0.50

1

AnyTool automates picking and calling hundreds of real APIs without extra model training, so teams can prototype API-heavy automation faster; expect higher success rates but plan for high token and API-call costs.

Key finding

AnyTool substantially improves real-task pass rates over prior systems.

Numbers: +35.4% average pass rate vs ToolLLM on ToolBench (as reported)

Use tool-calling to distill dialog into search queries and boost medical evidence retrieval.

0.50

0.60

0.50

1

Converting long patient dialogs into crisp search queries with an LLM tool call improves retrieval of official drug information, letting products deliver more evidence-backed medication advice without trusting LLM memory alone.

Key finding

Tool-calling distillation improves coarse document retrieval (HR@1).

Numbers: RagPULSE (7B) document HR@1 = 63.67% vs PULSE (7B) = 53.00% (+10.67 pp).

An open 20B model trained to spot, sequence, and call APIs reliably — ranks 4th on Berkeley's function-calling leaderboard.

0.70

0.60

1

GRANITE-20B-FUNCTIONCALLING is an open, production-ready model for reliable API selection and response synthesis; it lowers risk from calling wrong APIs and offers a license-friendly alternative to closed models.

Key finding

GRANITE-20B-FUNCTIONCALLING ranks 4th on BFCL overall accuracy and is the top open-license model.

Numbers: Overall Acc. 84.71 on BFCL (Table 4)

Design the lakehouse for agents first: solve concurrent runs with branching + isolated functions, and governance follows.

0.60

0.50

0

If you let agents mutate your lakehouse without transactional, runtime isolation, they can corrupt production data or leak secrets. Building a small, enforceable run API and sandboxed functions reduces risk and makes governance feasible.

Key finding

Multi-node pipelines need atomic commits across tables, not per-table transactions.

CEDAR: a three-agent system that produces interleaved plan-and-code notebooks to run data science locally

0.60

0.50

0

CEDAR reduces repetitive scripting by automating stepwise DS workflows while keeping data local, speeding prototyping and improving privacy controls for enterprise projects.

Key finding

CEDAR uses three LLM roles: an orchestrator plus separate text and code agents to produce a readable stepwise notebook.

Numbers: 3 agents (orchestrator, text agent, code agent)

A small library that lets LLM-driven agents call off-the-shelf pentest tools (nmap, nuclei, metasploit, curl) via the MCP RPC style.

0.40

0.60

0.50

0

Automates routine pentest steps and lets teams swap in better models or updated tools without changing agent code. That can speed internal red‑team work and reproduce attacks for testing. However, exploitation reliability depends on the model and legal safeguards are essential.

Key finding

PentestMCP exposes four core pentest servers (nmap, curl, nuclei, metasploit).

Numbers: 4 servers listed (Table 1–2)

TinyAgent — small on-device LLM agents that call functions and match GPT‑4‑Turbo on tool orchestration

0.70

0.60

0.80

0

TinyAgent shows you can run private, low-latency assistant features on-device with small models that match cloud performance on task-specific API orchestration.

Key finding

Fine-tuning small models on curated function-calling data yields large gains.

Numbers: TinyLlama-1.1B: 12.71% -> 78.89% success (after LoRA fine-tune)

APPL: a Python-native prompt language that auto-parallelizes LLM calls, traces runs, and turns functions into tools

0.70

0.60

0

APPL reduces development time and runtime cost for LLM-driven workflows by making prompts first-class in Python, auto-parallelizing independent calls, and enabling tool integration without manual spec writing.

Key finding

Automatic parallelization significantly reduces wall-clock time for independent LLM calls.

Numbers: CoT-SC (GPT-3.5): 27.6s → 2.9s (9.49× speedup); Table 2

Generate validated, machine-readable agent interaction records using only LLMs

0.60

0.70

0

Generates machine-readable agent interaction data at scale without human labeling. This can cut annotation cost, speed agent training cycles, and produce testbeds for function-calling accuracy and multi-turn behavior.

Key finding

The framework is implemented as four modular pipelines covering end-to-end records, DAG-based atomic triples, multi-turn dialogues, and rollout to SFT-ready chat examples.

Numbers: 4 pipelines (RecordSynth, DAGFirstGeneration, MultiTurnDialogueSynth, AgenticRecordRollout)

Multilingual user queries break tool calls when models put non‑English text into parameters.

0.50

0.65

0.35

0

If your product invokes external APIs from an LLM, non-English user inputs can produce non-executable calls even when intent is correct, risking silent failures and poor global UX.

Key finding

Parameter value language mismatch is the main cause of execution failures when queries are fully translated to non-English.

Add an explicit 'think' reasoning field to function calls to improve parameter accuracy and explain decisions

0.70

0.60

0.50

0

TAFC improves parameter accuracy and adds explainability for API calls without changing LLMs, reducing silent failures and easing debugging for tool-driven agents in production.

Key finding

TAFC improves Pass Rate across model sizes

Numbers: Pass Rate +1.6% to +2.5% (varies by model/size)

Bundle repeated multi-step tool calls into deterministic 'meta-tools' to cut LLM calls, cost, and failures.

0.70

0.60

0.70

0

AWO provides a low-effort win for production agents: bundle repeated multi-step API sequences into single deterministic calls to cut inference cost and latency by ~5–15% on evaluated workloads while often improving success rates.

Key finding

AWO reduces the number of LLM calls on evaluated benchmarks.

Numbers: LLM calls reduced up to 11.9% (APPWORLD, GPT 5.1)

Cut the wasted work: use a big model for intent, a small model for the call, and inject the fixed syntax.

0.75

0.80

0

HyFunc reduces latency and compute for live API-style agents, enabling faster, cheaper, and more responsive assistants without sacrificing accuracy.

Key finding

HyFunc reduces end-to-end inference latency to 0.828 seconds per case.

Numbers: Latency = 0.828s (HyFunc ♣, Table 1)