Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
TAFC improves parameter accuracy and adds explainability for API calls without changing LLMs, reducing silent failures and easing debugging for tool-driven agents in production.
Summary TLDR
TAFC augments API function signatures with an explicit 'think' field so LLMs explain how they chose each parameter. This adds per-parameter reasoning only when needed, tunes description prompts automatically, and fits existing LLMs and APIs with no model changes. On ToolBench (16k+ APIs) TAFC raises Pass and Win Rates consistently (≈+1.6–2.5% pass rate; larger relative gains for small models) and yields much higher judged parameter quality (TAFC win ~69.6% vs 18.2%).
Problem Statement
Current function-calling lacks per-parameter transparency. Models must pick multiple interdependent parameters without explicit internal justification, which makes debugging hard and increases errors on complex multi-parameter API calls.
Main Contribution
Introduce TAFC: add a structured 'think' parameter to function signatures so models output reasoning alongside parameter values.
Trigger per-parameter reasoning selectively via a complexity score that measures dependency, type complexity, and constraints.
Provide dynamic optimization (discrete prompt tuning and continuous prompt embeddings) to improve reasoning elicitation and align reasoning with human expectations.
Deploy TAFC without changing LLM architectures and keep full API compatibility; evaluate on ToolBench across proprietary and open-source models.
Key Findings
TAFC improves Pass Rate across model sizes
TAFC strongly improves judged parameter quality
TAFC reduces omission errors
Results
Average Pass Rate (small models)
Average Win Rate improvement (small models)
Parameter quality judged wins
Who Should Care
What To Try In 7 Days
Add a 'think' string field to critical multi-parameter API signatures and log its content.
Set a complexity threshold (start τ=0.6) to only generate per-parameter reasoning when needed.
Run a small ToolBench-style test set to compare Pass/Win Rates before and after TAFC inside your agent stack.
Agent Features
Planning
- Function Calling
Tool Use
- Function Calling
- Tool Selection
Frameworks
- ReAct
Is Agentic
true
Optimization Features
Token Efficiency
- Selective per-parameter reasoning to limit tokens
Infra Optimization
- No model change, avoids retraining base models
System Optimization
- Backwards-compatible signature augmentation and filtering at invocation
Training Optimization
- Reasoning-guided tool description optimization (semantic, logic, action losses)
Reproducibility
Data Urls
- ToolBench (ToolLLM) benchmark - ToolEval protocol; 16,000+ REST APIs
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Over-reasoning can hurt simple single-parameter functions; paper notes Standard FC outperforms TAFC in such cases
- Adds token and latency overhead when reasoning is generated
- Evaluation limited to ToolBench and judge-based assessments; real-world diversity may differ
- Relies on prompt/description tuning and an LLM judge, which may introduce bias
When Not To Use
- Simple single-parameter APIs where extra reasoning adds noise
- Ultra-low-latency or strict token-budget environments
- When you cannot capture or log the reasoning due to privacy or compliance
Failure Modes
- Reasoning hallucination leads to incorrect parameter values
- Over-reasoning increases error on trivial calls
- LLM-as-judge bias favors verbose or plausible-sounding reasoning over correctness
Core Entities
Models
- GPT-4o-0806
- Claude-3.5-Sonnet
- Qwen2.5-72B
- Qwen2.5-32B
- Qwen2.5-7B
- Llama-3.1-70B
- Llama-3.1-8B
Metrics
- Pass Rate
- Win Rate
- Parameter quality win rate
- Reasoning coherence
- Omission error rate
Datasets
- ToolBench (ToolLLM) - 16,000+ REST APIs
Benchmarks
- ToolBench
- ToolEval protocol

