Ask-when-Needed (AwN): make LLM agents ask clarifying questions before calling APIs

Overview

Decision SnapshotNeeds Validation

AwN is easy to apply as a prompting change and shows consistent accuracy gains across models, but automated judging and per-model cost trade-offs need extra validation before production.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals7

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Wenxuan Wang, Juluan Shi, Zixuan Ling, Yuk-Kit Chan, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

Links

Abstract / PDF

Why It Matters For Business

Agents that blindly fill missing arguments can call wrong APIs and produce unsafe or useless results; prompting them to ask reduces errors and improves end-user outcomes.

Who Should Care

Product Manager ML Engineer Engineering Lead Data Scientist CTO

Summary TLDR

Real user prompts are often ambiguous and cause LLM agents to hallucinate arguments and call wrong APIs. The authors analyze real errors, build NoisyToolBench (ambiguous-instruction benchmark), propose Ask-when-Needed (AwN) prompting to make agents ask clarifying questions, and build ToolEvaluator to auto-run and judge interactions. Across six LLMs and two frameworks, AwN raises the rate of asking the right clarification and improves correct API calls and final answers, while adding a modest number of extra steps in most cases.

Problem Statement

LLM agents that call external APIs fail when user instructions omit, misstate, or ask for things beyond available tools. Because LLMs often predict missing arguments instead of asking, they hallucinate and mis-execute APIs. The paper studies real instruction errors, measures agent behavior under noisy instructions, and seeks a practical prompting fix.

Main Contribution

Systematic analysis of real-world problematic user instructions and a four-way taxonomy of common errors.

NoisyToolBench: a benchmark that injects realistic ambiguous, incorrect, and out-of-capability user queries for API-based agents.

Key Findings

Most user instruction errors omit required details.

NumbersIMKI = 56.0%

Practical UseExpect over half of real queries to lack critical arguments; add clarification steps or validation before API calls.

Evidence RefTable 1, Section 3.1

Automated judging via ToolEvaluator closely matches humans on sampled cases.

NumbersToolEvaluator accuracy = 90%

Practical UseYou can automate large-scale agent evaluation but keep a small human audit due to remaining error risk.

Evidence RefSection 5.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
A1 (ask expected clarifying questions)	up to 0.90	examples: 0.52 (CoT, gpt-4o)	+0.38 (example)	NoisyToolBench	Table 2; 'Main Result' paragraph	Table 2
A2 (invoke correct API with correct args)	example improved to 0.58	example baseline 0.48	+0.10 (example)	NoisyToolBench	Table 2; 'Main Result' paragraph	Table 2

What To Try In 7 Days

Run a small experiment: add AwN-style pre-call checks to your agent and compare A1/A2/A3 on a sample of ambiguous prompts.

Use ToolEvaluator or a simple semantic-similarity judge to automate evaluation and spot failure modes quickly.

Add a lightweight human-audit on a 50-case sample to validate automated judgments before rollout.

Agent Features

Planning

Ask-when-Needed decision: check before callChain-of-Thought (CoT) integration

Tool Use

function callingmulti-step API sequencesclarifying-question loop

Frameworks

AwNCoTReActDFSDT

Is Agentic

Yes

Architectures

LLM + API function calling

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

AwN improves but does not fully close the performance gap; many cases still fail.

ToolEvaluator is not perfect; automated judgments produce some false positives and negatives.

When Not To Use

When all user instructions are guaranteed complete and precise

When strict low-latency is required and extra clarification rounds are unacceptable

Failure Modes

Agent asks many redundant or irrelevant questions, increasing latency and cost

Automated judge mislabels correct/incorrect behavior (ToolEvaluator errors)

Core Entities

Models

gpt-3.5-turbo-0125gpt-4-turbo-2024-0409gpt-4o-2024-11-20deepseek-v3gemini-1.5flash-latestclaude-3-5-haiku-20241022

Metrics

A1A2A3ReSteps

Datasets

NoisyToolBenchToolBench

Benchmarks

NoisyToolBenchToolBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Most user instruction errors omit required details.

Automated judging via ToolEvaluator closely matches humans on sampled cases.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

Key finding

Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

Key finding