Ask-when-Needed (AwN): make LLM agents ask clarifying questions before calling APIs

August 31, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Wenxuan Wang, Juluan Shi, Zixuan Ling, Yuk-Kit Chan, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

Links

Abstract / PDF

Why It Matters For Business

Agents that blindly fill missing arguments can call wrong APIs and produce unsafe or useless results; prompting them to ask reduces errors and improves end-user outcomes.

Summary TLDR

Real user prompts are often ambiguous and cause LLM agents to hallucinate arguments and call wrong APIs. The authors analyze real errors, build NoisyToolBench (ambiguous-instruction benchmark), propose Ask-when-Needed (AwN) prompting to make agents ask clarifying questions, and build ToolEvaluator to auto-run and judge interactions. Across six LLMs and two frameworks, AwN raises the rate of asking the right clarification and improves correct API calls and final answers, while adding a modest number of extra steps in most cases.

Problem Statement

LLM agents that call external APIs fail when user instructions omit, misstate, or ask for things beyond available tools. Because LLMs often predict missing arguments instead of asking, they hallucinate and mis-execute APIs. The paper studies real instruction errors, measures agent behavior under noisy instructions, and seeks a practical prompting fix.

Main Contribution

Systematic analysis of real-world problematic user instructions and a four-way taxonomy of common errors.

NoisyToolBench: a benchmark that injects realistic ambiguous, incorrect, and out-of-capability user queries for API-based agents.

Ask-when-Needed (AwN): a prompting framework that forces agents to check API requirements and ask clarifying questions before calling functions.

ToolEvaluator: an automated pipeline that proxies users and uses semantic similarity plus GPT-4o judging to evaluate accuracy and efficiency.

Key Findings

Most user instruction errors omit required details.

NumbersIMKI = 56.0%

Automated judging via ToolEvaluator closely matches humans on sampled cases.

NumbersToolEvaluator accuracy = 90%

AwN greatly increases agents' chance to ask the right clarifying question (A1).

Numbersexample gpt-4o CoT A1: 0.52 -> 0.90

AwN improves correct API calls (A2) and final answers (A3) in experiments.

Numbersexample A2: 0.48 -> 0.58; A3 improvements reported across models

Results

A1 (ask expected clarifying questions)

Valueup to 0.90

Baselineexamples: 0.52 (CoT, gpt-4o)

A2 (invoke correct API with correct args)

Valueexample improved to 0.58

Baselineexample baseline 0.48

A3 (final answer aligns with intent)

Valuevaried across models, increased in many cases

Baselinesee Table 2 for per-model values

ToolEvaluator agreement with human judges

Value0.90 accuracy

Instruction error distribution (four categories)

ValueIMKI 56.0%, IMR 11.3%, IwE 17.3%, IBTC 15.3%

Who Should Care

What To Try In 7 Days

Run a small experiment: add AwN-style pre-call checks to your agent and compare A1/A2/A3 on a sample of ambiguous prompts.

Use ToolEvaluator or a simple semantic-similarity judge to automate evaluation and spot failure modes quickly.

Add a lightweight human-audit on a 50-case sample to validate automated judgments before rollout.

Agent Features

Planning

  • Ask-when-Needed decision: check before call
  • Chain-of-Thought (CoT) integration

Tool Use

  • function calling
  • multi-step API sequences
  • clarifying-question loop

Frameworks

  • AwN
  • CoT
  • ReAct
  • DFSDT

Is Agentic

true

Architectures

  • LLM + API function calling

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • AwN improves but does not fully close the performance gap; many cases still fail.
  • ToolEvaluator is not perfect; automated judgments produce some false positives and negatives.

When Not To Use

  • When all user instructions are guaranteed complete and precise
  • When strict low-latency is required and extra clarification rounds are unacceptable

Failure Modes

  • Agent asks many redundant or irrelevant questions, increasing latency and cost
  • Automated judge mislabels correct/incorrect behavior (ToolEvaluator errors)
  • Agent still hallucinates or chooses wrong API despite asking

Core Entities

Models

  • gpt-3.5-turbo-0125
  • gpt-4-turbo-2024-0409
  • gpt-4o-2024-11-20
  • deepseek-v3
  • gemini-1.5flash-latest
  • claude-3-5-haiku-20241022

Metrics

  • A1
  • A2
  • A3
  • Re
  • Steps

Datasets

  • NoisyToolBench
  • ToolBench

Benchmarks

  • NoisyToolBench
  • ToolBench