Ask-when-Needed (AwN): make LLM agents ask clarifying questions before calling APIs

August 31, 20247 min

Overview

Decision SnapshotNeeds Validation

AwN is easy to apply as a prompting change and shows consistent accuracy gains across models, but automated judging and per-model cost trade-offs need extra validation before production.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals7

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Wenxuan Wang, Juluan Shi, Zixuan Ling, Yuk-Kit Chan, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

Links

Abstract / PDF

Why It Matters For Business

Agents that blindly fill missing arguments can call wrong APIs and produce unsafe or useless results; prompting them to ask reduces errors and improves end-user outcomes.

Who Should Care

Summary TLDR

Real user prompts are often ambiguous and cause LLM agents to hallucinate arguments and call wrong APIs. The authors analyze real errors, build NoisyToolBench (ambiguous-instruction benchmark), propose Ask-when-Needed (AwN) prompting to make agents ask clarifying questions, and build ToolEvaluator to auto-run and judge interactions. Across six LLMs and two frameworks, AwN raises the rate of asking the right clarification and improves correct API calls and final answers, while adding a modest number of extra steps in most cases.

Problem Statement

LLM agents that call external APIs fail when user instructions omit, misstate, or ask for things beyond available tools. Because LLMs often predict missing arguments instead of asking, they hallucinate and mis-execute APIs. The paper studies real instruction errors, measures agent behavior under noisy instructions, and seeks a practical prompting fix.

Main Contribution

Systematic analysis of real-world problematic user instructions and a four-way taxonomy of common errors.

NoisyToolBench: a benchmark that injects realistic ambiguous, incorrect, and out-of-capability user queries for API-based agents.

Key Findings

Most user instruction errors omit required details.

NumbersIMKI = 56.0%

Practical UseExpect over half of real queries to lack critical arguments; add clarification steps or validation before API calls.

Evidence RefTable 1, Section 3.1

Automated judging via ToolEvaluator closely matches humans on sampled cases.

NumbersToolEvaluator accuracy = 90%

Practical UseYou can automate large-scale agent evaluation but keep a small human audit due to remaining error risk.

Evidence RefSection 5.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
A1 (ask expected clarifying questions)up to 0.90examples: 0.52 (CoT, gpt-4o)+0.38 (example)NoisyToolBenchTable 2; 'Main Result' paragraphTable 2
A2 (invoke correct API with correct args)example improved to 0.58example baseline 0.48+0.10 (example)NoisyToolBenchTable 2; 'Main Result' paragraphTable 2

What To Try In 7 Days

Run a small experiment: add AwN-style pre-call checks to your agent and compare A1/A2/A3 on a sample of ambiguous prompts.

Use ToolEvaluator or a simple semantic-similarity judge to automate evaluation and spot failure modes quickly.

Add a lightweight human-audit on a 50-case sample to validate automated judgments before rollout.

Agent Features

Planning
Ask-when-Needed decision: check before callChain-of-Thought (CoT) integration
Tool Use
function callingmulti-step API sequencesclarifying-question loop
Frameworks
AwNCoTReActDFSDT
Is Agentic

Yes

Architectures
LLM + API function calling

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

AwN improves but does not fully close the performance gap; many cases still fail.

ToolEvaluator is not perfect; automated judgments produce some false positives and negatives.

When Not To Use

When all user instructions are guaranteed complete and precise

When strict low-latency is required and extra clarification rounds are unacceptable

Failure Modes

Agent asks many redundant or irrelevant questions, increasing latency and cost

Automated judge mislabels correct/incorrect behavior (ToolEvaluator errors)

Core Entities

Models

gpt-3.5-turbo-0125gpt-4-turbo-2024-0409gpt-4o-2024-11-20deepseek-v3gemini-1.5flash-latestclaude-3-5-haiku-20241022

Metrics

A1A2A3ReSteps

Datasets

NoisyToolBenchToolBench

Benchmarks

NoisyToolBenchToolBench