Overview
TAFC is immediately deployable with existing LLMs and APIs; evidence from ToolBench shows consistent, moderate improvements and stronger gains on smaller models.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
TAFC improves parameter accuracy and adds explainability for API calls without changing LLMs, reducing silent failures and easing debugging for tool-driven agents in production.
Who Should Care
Summary TLDR
TAFC augments API function signatures with an explicit 'think' field so LLMs explain how they chose each parameter. This adds per-parameter reasoning only when needed, tunes description prompts automatically, and fits existing LLMs and APIs with no model changes. On ToolBench (16k+ APIs) TAFC raises Pass and Win Rates consistently (≈+1.6–2.5% pass rate; larger relative gains for small models) and yields much higher judged parameter quality (TAFC win ~69.6% vs 18.2%).
Problem Statement
Current function-calling lacks per-parameter transparency. Models must pick multiple interdependent parameters without explicit internal justification, which makes debugging hard and increases errors on complex multi-parameter API calls.
Main Contribution
Introduce TAFC: add a structured 'think' parameter to function signatures so models output reasoning alongside parameter values.
Trigger per-parameter reasoning selectively via a complexity score that measures dependency, type complexity, and constraints.
Key Findings
TAFC improves Pass Rate across model sizes
TAFC strongly improves judged parameter quality
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average Pass Rate (small models) | ≈+2.4–2.5% absolute | Standard Function Calling | +2.4–2.5% | ToolBench (Llama-3.1-8B, Qwen2.5-7B) | Table 1 shows Llama-3.1-8B avg 27.3% → 29.7%; Qwen2.5-7B 28.7% → 31.2% | Table 1 |
| Average Win Rate improvement (small models) | ≈+2.9–3.1% absolute | Standard Function Calling | +2.9–3.1% | ToolBench (I1/I2/I3) | Section 3.2 reports Win Rate gains of 2.9–3.1% for small models | Section 3.2; Table 1 |
What To Try In 7 Days
Add a 'think' string field to critical multi-parameter API signatures and log its content.
Set a complexity threshold (start τ=0.6) to only generate per-parameter reasoning when needed.
Run a small ToolBench-style test set to compare Pass/Win Rates before and after TAFC inside your agent stack.
Agent Features
Planning
Tool Use
Frameworks
Is Agentic
Yes
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Over-reasoning can hurt simple single-parameter functions; paper notes Standard FC outperforms TAFC in such cases
Adds token and latency overhead when reasoning is generated
When Not To Use
Simple single-parameter APIs where extra reasoning adds noise
Ultra-low-latency or strict token-budget environments
Failure Modes
Reasoning hallucination leads to incorrect parameter values
Over-reasoning increases error on trivial calls

