MultiAPI: a 2,038-prompt, 235-function benchmark that shows LLMs know when to call tools but struggle to pick the right tool and arguments

November 21, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.35

Citation Count

1

Authors

Xiao Liu, Jianfeng Lin, Jiawei Zhang

Links

Abstract / PDF

Why It Matters For Business

Tool-augmented LLMs can detect when to call external multimodal tools but often select the wrong tool or give bad arguments; validate tool selection and add argument checks before shipping to avoid broken user-facing features.

Summary TLDR

This paper builds MultiAPI, a human-refined benchmark of 235 executable multimodal API functions and 2,038 instructions, and tests GPT-3.5 and Llama2 on tool-augmented multimodal tasks. Models almost always decide to call a tool (≈99.8% invoke accuracy for GPT-3.5) but often pick the wrong domain/function and fail to produce correct arguments (function accuracy ≈53%, argument exact-match ≈43% for GPT-3.5). Adding explicit domain descriptions and a secondary argument editor meaningfully improves domain, function, and argument scores.

Problem Statement

LLMs are strong on text but real-world problems need multimodal tools. There is no large, executable API benchmark to test whether LLMs can pick the right multimodal tool and produce correct arguments. Without such tests we can't reliably integrate LLMs with vision/audio tools in products.

Main Contribution

Released MultiAPI: 235 executable API functions and 2,038 human-refined prompts for multimodal tool evaluation.

Defined a 4-step evaluation (invoke / domain / function / argument) that treats tool use as a text-matching task.

Ran experiments on API-based (gpt-3.5-turbo-0613) and open Llama models, found high invoke success but weak domain/function/argument selection, and proposed fixes (domain descriptions + argument editor) that improve scores.

Key Findings

LLMs reliably detect when to call an API.

NumbersGPT-3.5 invoke accuracy = 99.82% (Table 2)

Picking the correct domain and specific function is error-prone.

NumbersGPT-3.5 domain=71.78%, function=52.94% (Table 2)

Argument generation is a major bottleneck for successful tool use.

NumbersExact argument match = 42.68%; concept similarity ≈46.61 (Table 3)

Adding example context (in-context learning) often reduced function selection performance.

NumbersDomain/function drop comparing GPT-3.5 to GPT-3.5-ict (71.78→68.07; 52.94→48.35)

Simple fixes (domain descriptions + argument correction) improved all metrics.

NumbersDomain 71.78→76.31, Function 51.73→59.47, Arg 42.68→48.82, Sim 46.61→56.82 (Table 5)

Results

Accuracy

ValueGPT-3.5 99.82%

Accuracy

ValueGPT-3.5 71.78%

Accuracy

ValueGPT-3.5 52.94%

Accuracy

ValueGPT-3.5 42.68%

Concept Argument Similarity

ValueGPT-3.5 Sim ≈ 46.61 (cosine)

Improvement with domain prompt + argument correction

ValueDomain 71.78→76.31, Func 51.73→59.47, Arg 42.68→48.82, Sim 46.61→56.82

BaselineGPT-3.5

Sequential API second-step drop

ValueGPT-3.5-fc Func2 40.00% vs Func1 46.67%

Who Should Care

What To Try In 7 Days

Run MultiAPI or a small subset against your system to measure invoke/domain/function/argument failures

Add concise domain descriptions to the system prompt to reduce domain confusion

Insert a lightweight argument editor (secondary LLM or rules) to validate and correct args before calls, starting with file-path checks and prompt normalization for generators

Agent Features

Tool Use

  • Function Calling

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Context window forces dataset splits; only ~25 functions per split were tested.
  • Experiments use GPT-3.5 and Llama2 only; results may differ for other models.
  • Human refinement and argument standardization introduce curator bias.
  • Evaluation uses function call matching, not downstream output correctness.

When Not To Use

  • Do not assume function-call accuracy implies correct downstream outputs; validate end-to-end results.
  • Avoid using benchmark results as final QA for safety-sensitive systems without human checks.

Failure Modes

  • Confusing image classification, segmentation, and detection domains leading to wrong function calls.
  • Incorrect or malformed exact-match arguments (file paths) causing failed executions.
  • Sequential API calls degrade on later steps, increasing error propagation.

Core Entities

Models

  • gpt-3.5-turbo-0613
  • Llama2-13B

Metrics

  • Accuracy
  • ROUGE-1/2/L
  • Cosine Similarity (argument embeddings)

Datasets

  • MultiAPI (this paper, 235 functions, 2,038 prompts)
  • HuggingFace instruction-code dataset (Patil et al., 2023)

Benchmarks

  • MultiAPI
  • MultiAPI-SEQ (sequential 2-step API calls)