Add an explicit 'think' reasoning field to function calls to improve parameter accuracy and explain decisions

January 26, 20266 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Lei Wei, Xiao Peng, Jinpeng Ou, Bin Wang

Links

Abstract / PDF

Why It Matters For Business

TAFC improves parameter accuracy and adds explainability for API calls without changing LLMs, reducing silent failures and easing debugging for tool-driven agents in production.

Summary TLDR

TAFC augments API function signatures with an explicit 'think' field so LLMs explain how they chose each parameter. This adds per-parameter reasoning only when needed, tunes description prompts automatically, and fits existing LLMs and APIs with no model changes. On ToolBench (16k+ APIs) TAFC raises Pass and Win Rates consistently (≈+1.6–2.5% pass rate; larger relative gains for small models) and yields much higher judged parameter quality (TAFC win ~69.6% vs 18.2%).

Problem Statement

Current function-calling lacks per-parameter transparency. Models must pick multiple interdependent parameters without explicit internal justification, which makes debugging hard and increases errors on complex multi-parameter API calls.

Main Contribution

Introduce TAFC: add a structured 'think' parameter to function signatures so models output reasoning alongside parameter values.

Trigger per-parameter reasoning selectively via a complexity score that measures dependency, type complexity, and constraints.

Provide dynamic optimization (discrete prompt tuning and continuous prompt embeddings) to improve reasoning elicitation and align reasoning with human expectations.

Deploy TAFC without changing LLM architectures and keep full API compatibility; evaluate on ToolBench across proprietary and open-source models.

Key Findings

TAFC improves Pass Rate across model sizes

NumbersPass Rate +1.6% to +2.5% (varies by model/size)

TAFC strongly improves judged parameter quality

NumbersAverage TAFC win 69.6% vs Standard 18.2%

TAFC reduces omission errors

NumbersOmissions reduced by 38%

Results

Average Pass Rate (small models)

Value≈+2.4–2.5% absolute

BaselineStandard Function Calling

Average Win Rate improvement (small models)

Value≈+2.9–3.1% absolute

BaselineStandard Function Calling

Parameter quality judged wins

ValueTAFC wins 69.6% (average)

BaselineStandard FC wins 18.2%

Who Should Care

What To Try In 7 Days

Add a 'think' string field to critical multi-parameter API signatures and log its content.

Set a complexity threshold (start τ=0.6) to only generate per-parameter reasoning when needed.

Run a small ToolBench-style test set to compare Pass/Win Rates before and after TAFC inside your agent stack.

Agent Features

Planning

  • Function Calling

Tool Use

  • Function Calling
  • Tool Selection

Frameworks

  • ReAct

Is Agentic

true

Optimization Features

Token Efficiency

  • Selective per-parameter reasoning to limit tokens

Infra Optimization

  • No model change, avoids retraining base models

System Optimization

  • Backwards-compatible signature augmentation and filtering at invocation

Training Optimization

  • Reasoning-guided tool description optimization (semantic, logic, action losses)

Reproducibility

Data Urls

  • ToolBench (ToolLLM) benchmark - ToolEval protocol; 16,000+ REST APIs

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Over-reasoning can hurt simple single-parameter functions; paper notes Standard FC outperforms TAFC in such cases
  • Adds token and latency overhead when reasoning is generated
  • Evaluation limited to ToolBench and judge-based assessments; real-world diversity may differ
  • Relies on prompt/description tuning and an LLM judge, which may introduce bias

When Not To Use

  • Simple single-parameter APIs where extra reasoning adds noise
  • Ultra-low-latency or strict token-budget environments
  • When you cannot capture or log the reasoning due to privacy or compliance

Failure Modes

  • Reasoning hallucination leads to incorrect parameter values
  • Over-reasoning increases error on trivial calls
  • LLM-as-judge bias favors verbose or plausible-sounding reasoning over correctness

Core Entities

Models

  • GPT-4o-0806
  • Claude-3.5-Sonnet
  • Qwen2.5-72B
  • Qwen2.5-32B
  • Qwen2.5-7B
  • Llama-3.1-70B
  • Llama-3.1-8B

Metrics

  • Pass Rate
  • Win Rate
  • Parameter quality win rate
  • Reasoning coherence
  • Omission error rate

Datasets

  • ToolBench (ToolLLM) - 16,000+ REST APIs

Benchmarks

  • ToolBench
  • ToolEval protocol