Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.35
Citation Count
1
Why It Matters For Business
Tool-augmented LLMs can detect when to call external multimodal tools but often select the wrong tool or give bad arguments; validate tool selection and add argument checks before shipping to avoid broken user-facing features.
Summary TLDR
This paper builds MultiAPI, a human-refined benchmark of 235 executable multimodal API functions and 2,038 instructions, and tests GPT-3.5 and Llama2 on tool-augmented multimodal tasks. Models almost always decide to call a tool (≈99.8% invoke accuracy for GPT-3.5) but often pick the wrong domain/function and fail to produce correct arguments (function accuracy ≈53%, argument exact-match ≈43% for GPT-3.5). Adding explicit domain descriptions and a secondary argument editor meaningfully improves domain, function, and argument scores.
Problem Statement
LLMs are strong on text but real-world problems need multimodal tools. There is no large, executable API benchmark to test whether LLMs can pick the right multimodal tool and produce correct arguments. Without such tests we can't reliably integrate LLMs with vision/audio tools in products.
Main Contribution
Released MultiAPI: 235 executable API functions and 2,038 human-refined prompts for multimodal tool evaluation.
Defined a 4-step evaluation (invoke / domain / function / argument) that treats tool use as a text-matching task.
Ran experiments on API-based (gpt-3.5-turbo-0613) and open Llama models, found high invoke success but weak domain/function/argument selection, and proposed fixes (domain descriptions + argument editor) that improve scores.
Key Findings
LLMs reliably detect when to call an API.
Picking the correct domain and specific function is error-prone.
Argument generation is a major bottleneck for successful tool use.
Adding example context (in-context learning) often reduced function selection performance.
Simple fixes (domain descriptions + argument correction) improved all metrics.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Concept Argument Similarity
Improvement with domain prompt + argument correction
Sequential API second-step drop
Who Should Care
What To Try In 7 Days
Run MultiAPI or a small subset against your system to measure invoke/domain/function/argument failures
Add concise domain descriptions to the system prompt to reduce domain confusion
Insert a lightweight argument editor (secondary LLM or rules) to validate and correct args before calls, starting with file-path checks and prompt normalization for generators
Agent Features
Tool Use
- Function Calling
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Context window forces dataset splits; only ~25 functions per split were tested.
- Experiments use GPT-3.5 and Llama2 only; results may differ for other models.
- Human refinement and argument standardization introduce curator bias.
- Evaluation uses function call matching, not downstream output correctness.
When Not To Use
- Do not assume function-call accuracy implies correct downstream outputs; validate end-to-end results.
- Avoid using benchmark results as final QA for safety-sensitive systems without human checks.
Failure Modes
- Confusing image classification, segmentation, and detection domains leading to wrong function calls.
- Incorrect or malformed exact-match arguments (file paths) causing failed executions.
- Sequential API calls degrade on later steps, increasing error propagation.
Core Entities
Models
- gpt-3.5-turbo-0613
- Llama2-13B
Metrics
- Accuracy
- ROUGE-1/2/L
- Cosine Similarity (argument embeddings)
Datasets
- MultiAPI (this paper, 235 functions, 2,038 prompts)
- HuggingFace instruction-code dataset (Patil et al., 2023)
Benchmarks
- MultiAPI
- MultiAPI-SEQ (sequential 2-step API calls)

