MultiAPI: a 2,038-prompt, 235-function benchmark that shows LLMs know when to call tools but struggle to pick the right tool and arguments

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and the experiments clearly show specific failure modes; results are limited to GPT-3.5 and Llama2 and to the created function set.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 35%

Production readiness: 40%

Novelty: 60%

Authors

Xiao Liu, Jianfeng Lin, Jiawei Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Tool-augmented LLMs can detect when to call external multimodal tools but often select the wrong tool or give bad arguments; validate tool selection and add argument checks before shipping to avoid broken user-facing features.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper builds MultiAPI, a human-refined benchmark of 235 executable multimodal API functions and 2,038 instructions, and tests GPT-3.5 and Llama2 on tool-augmented multimodal tasks. Models almost always decide to call a tool (≈99.8% invoke accuracy for GPT-3.5) but often pick the wrong domain/function and fail to produce correct arguments (function accuracy ≈53%, argument exact-match ≈43% for GPT-3.5). Adding explicit domain descriptions and a secondary argument editor meaningfully improves domain, function, and argument scores.

Problem Statement

LLMs are strong on text but real-world problems need multimodal tools. There is no large, executable API benchmark to test whether LLMs can pick the right multimodal tool and produce correct arguments. Without such tests we can't reliably integrate LLMs with vision/audio tools in products.

Main Contribution

Released MultiAPI: 235 executable API functions and 2,038 human-refined prompts for multimodal tool evaluation.

Defined a 4-step evaluation (invoke / domain / function / argument) that treats tool use as a text-matching task.

Key Findings

LLMs reliably detect when to call an API.

NumbersGPT-3.5 invoke accuracy = 99.82% (Table 2)

Practical UseYou can trust GPT-3.5 to know when a user needs an external tool; focus engineering effort on which tool and arguments to pick.

Evidence RefTable 2

Picking the correct domain and specific function is error-prone.

NumbersGPT-3.5 domain=71.78%, function=52.94% (Table 2)

Practical UseAdd domain guidance or disambiguation steps before invocation; expect roughly half of function selections to be wrong without fixes.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-3.5 99.82%	—	—	MultiAPI (Table 2)	High invoke accuracy shows models detect need for tools	Table 2
Accuracy	GPT-3.5 71.78%	—	—	MultiAPI (Table 2)	Models confused among image analysis domains	Table 2; Figure 2

What To Try In 7 Days

Run MultiAPI or a small subset against your system to measure invoke/domain/function/argument failures

Add concise domain descriptions to the system prompt to reduce domain confusion

Insert a lightweight argument editor (secondary LLM or rules) to validate and correct args before calls, starting with file-path checks and prompt normalization for generators

Agent Features

Tool Use

Function Calling

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/HaroldLiuJ/MultiAPI

Data URLs

https://github.com/HaroldLiuJ/MultiAPI

Risks & Boundaries

Limitations

Context window forces dataset splits; only ~25 functions per split were tested.

Experiments use GPT-3.5 and Llama2 only; results may differ for other models.

When Not To Use

Do not assume function-call accuracy implies correct downstream outputs; validate end-to-end results.

Avoid using benchmark results as final QA for safety-sensitive systems without human checks.

Failure Modes

Confusing image classification, segmentation, and detection domains leading to wrong function calls.

Incorrect or malformed exact-match arguments (file paths) causing failed executions.

Core Entities

Models

gpt-3.5-turbo-0613Llama2-13B

Metrics

AccuracyROUGE-1/2/LCosine Similarity (argument embeddings)

Datasets

MultiAPI (this paper, 235 functions, 2,038 prompts)HuggingFace instruction-code dataset (Patil et al., 2023)

Benchmarks

MultiAPIMultiAPI-SEQ (sequential 2-step API calls)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs reliably detect when to call an API.

Picking the correct domain and specific function is error-prone.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-