Overview
AwN is easy to apply as a prompting change and shows consistent accuracy gains across models, but automated judging and per-model cost trade-offs need extra validation before production.
Citations0
Evidence Strength0.60
Confidence0.80
Risk Signals7
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
Agents that blindly fill missing arguments can call wrong APIs and produce unsafe or useless results; prompting them to ask reduces errors and improves end-user outcomes.
Who Should Care
Summary TLDR
Real user prompts are often ambiguous and cause LLM agents to hallucinate arguments and call wrong APIs. The authors analyze real errors, build NoisyToolBench (ambiguous-instruction benchmark), propose Ask-when-Needed (AwN) prompting to make agents ask clarifying questions, and build ToolEvaluator to auto-run and judge interactions. Across six LLMs and two frameworks, AwN raises the rate of asking the right clarification and improves correct API calls and final answers, while adding a modest number of extra steps in most cases.
Problem Statement
LLM agents that call external APIs fail when user instructions omit, misstate, or ask for things beyond available tools. Because LLMs often predict missing arguments instead of asking, they hallucinate and mis-execute APIs. The paper studies real instruction errors, measures agent behavior under noisy instructions, and seeks a practical prompting fix.
Main Contribution
Systematic analysis of real-world problematic user instructions and a four-way taxonomy of common errors.
NoisyToolBench: a benchmark that injects realistic ambiguous, incorrect, and out-of-capability user queries for API-based agents.
Key Findings
Most user instruction errors omit required details.
Automated judging via ToolEvaluator closely matches humans on sampled cases.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| A1 (ask expected clarifying questions) | up to 0.90 | examples: 0.52 (CoT, gpt-4o) | +0.38 (example) | NoisyToolBench | Table 2; 'Main Result' paragraph | Table 2 |
| A2 (invoke correct API with correct args) | example improved to 0.58 | example baseline 0.48 | +0.10 (example) | NoisyToolBench | Table 2; 'Main Result' paragraph | Table 2 |
What To Try In 7 Days
Run a small experiment: add AwN-style pre-call checks to your agent and compare A1/A2/A3 on a sample of ambiguous prompts.
Use ToolEvaluator or a simple semantic-similarity judge to automate evaluation and spot failure modes quickly.
Add a lightweight human-audit on a 50-case sample to validate automated judgments before rollout.
Agent Features
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Reproducibility
Risks & Boundaries
Limitations
AwN improves but does not fully close the performance gap; many cases still fail.
ToolEvaluator is not perfect; automated judgments produce some false positives and negatives.
When Not To Use
When all user instructions are guaranteed complete and precise
When strict low-latency is required and extra clarification rounds are unacceptable
Failure Modes
Agent asks many redundant or irrelevant questions, increasing latency and cost
Automated judge mislabels correct/incorrect behavior (ToolEvaluator errors)

