Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
Agents that blindly fill missing arguments can call wrong APIs and produce unsafe or useless results; prompting them to ask reduces errors and improves end-user outcomes.
Summary TLDR
Real user prompts are often ambiguous and cause LLM agents to hallucinate arguments and call wrong APIs. The authors analyze real errors, build NoisyToolBench (ambiguous-instruction benchmark), propose Ask-when-Needed (AwN) prompting to make agents ask clarifying questions, and build ToolEvaluator to auto-run and judge interactions. Across six LLMs and two frameworks, AwN raises the rate of asking the right clarification and improves correct API calls and final answers, while adding a modest number of extra steps in most cases.
Problem Statement
LLM agents that call external APIs fail when user instructions omit, misstate, or ask for things beyond available tools. Because LLMs often predict missing arguments instead of asking, they hallucinate and mis-execute APIs. The paper studies real instruction errors, measures agent behavior under noisy instructions, and seeks a practical prompting fix.
Main Contribution
Systematic analysis of real-world problematic user instructions and a four-way taxonomy of common errors.
NoisyToolBench: a benchmark that injects realistic ambiguous, incorrect, and out-of-capability user queries for API-based agents.
Ask-when-Needed (AwN): a prompting framework that forces agents to check API requirements and ask clarifying questions before calling functions.
ToolEvaluator: an automated pipeline that proxies users and uses semantic similarity plus GPT-4o judging to evaluate accuracy and efficiency.
Key Findings
Most user instruction errors omit required details.
Automated judging via ToolEvaluator closely matches humans on sampled cases.
AwN greatly increases agents' chance to ask the right clarifying question (A1).
AwN improves correct API calls (A2) and final answers (A3) in experiments.
Results
A1 (ask expected clarifying questions)
A2 (invoke correct API with correct args)
A3 (final answer aligns with intent)
ToolEvaluator agreement with human judges
Instruction error distribution (four categories)
Who Should Care
What To Try In 7 Days
Run a small experiment: add AwN-style pre-call checks to your agent and compare A1/A2/A3 on a sample of ambiguous prompts.
Use ToolEvaluator or a simple semantic-similarity judge to automate evaluation and spot failure modes quickly.
Add a lightweight human-audit on a 50-case sample to validate automated judgments before rollout.
Agent Features
Planning
- Ask-when-Needed decision: check before call
- Chain-of-Thought (CoT) integration
Tool Use
- function calling
- multi-step API sequences
- clarifying-question loop
Frameworks
- AwN
- CoT
- ReAct
- DFSDT
Is Agentic
true
Architectures
- LLM + API function calling
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- AwN improves but does not fully close the performance gap; many cases still fail.
- ToolEvaluator is not perfect; automated judgments produce some false positives and negatives.
When Not To Use
- When all user instructions are guaranteed complete and precise
- When strict low-latency is required and extra clarification rounds are unacceptable
Failure Modes
- Agent asks many redundant or irrelevant questions, increasing latency and cost
- Automated judge mislabels correct/incorrect behavior (ToolEvaluator errors)
- Agent still hallucinates or chooses wrong API despite asking
Core Entities
Models
- gpt-3.5-turbo-0125
- gpt-4-turbo-2024-0409
- gpt-4o-2024-11-20
- deepseek-v3
- gemini-1.5flash-latest
- claude-3-5-haiku-20241022
Metrics
- A1
- A2
- A3
- Re
- Steps
Datasets
- NoisyToolBench
- ToolBench
Benchmarks
- NoisyToolBench
- ToolBench

