Add an explicit 'think' reasoning field to function calls to improve parameter accuracy and explain decisions

Overview

Decision SnapshotReady For Pilot

TAFC is immediately deployable with existing LLMs and APIs; evidence from ToolBench shows consistent, moderate improvements and stronger gains on smaller models.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Lei Wei, Xiao Peng, Jinpeng Ou, Bin Wang

Links

Abstract / PDF / Data

Why It Matters For Business

TAFC improves parameter accuracy and adds explainability for API calls without changing LLMs, reducing silent failures and easing debugging for tool-driven agents in production.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

TAFC augments API function signatures with an explicit 'think' field so LLMs explain how they chose each parameter. This adds per-parameter reasoning only when needed, tunes description prompts automatically, and fits existing LLMs and APIs with no model changes. On ToolBench (16k+ APIs) TAFC raises Pass and Win Rates consistently (≈+1.6–2.5% pass rate; larger relative gains for small models) and yields much higher judged parameter quality (TAFC win ~69.6% vs 18.2%).

Problem Statement

Current function-calling lacks per-parameter transparency. Models must pick multiple interdependent parameters without explicit internal justification, which makes debugging hard and increases errors on complex multi-parameter API calls.

Main Contribution

Introduce TAFC: add a structured 'think' parameter to function signatures so models output reasoning alongside parameter values.

Trigger per-parameter reasoning selectively via a complexity score that measures dependency, type complexity, and constraints.

Key Findings

TAFC improves Pass Rate across model sizes

NumbersPass Rate +1.6% to +2.5% (varies by model/size)

Practical UseExpect a modest but consistent accuracy boost when calling multi-parameter APIs; benefits are larger on smaller models.

Evidence RefTable 1; Section 3.2

TAFC strongly improves judged parameter quality

NumbersAverage TAFC win 69.6% vs Standard 18.2%

Practical UseWhen you need parameters that match human intent or constraints, TAFC yields much better parameter choices on evaluated benchmarks.

Evidence RefTable 2; Section 3.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average Pass Rate (small models)	≈+2.4–2.5% absolute	Standard Function Calling	+2.4–2.5%	ToolBench (Llama-3.1-8B, Qwen2.5-7B)	Table 1 shows Llama-3.1-8B avg 27.3% → 29.7%; Qwen2.5-7B 28.7% → 31.2%	Table 1
Average Win Rate improvement (small models)	≈+2.9–3.1% absolute	Standard Function Calling	+2.9–3.1%	ToolBench (I1/I2/I3)	Section 3.2 reports Win Rate gains of 2.9–3.1% for small models	Section 3.2; Table 1

What To Try In 7 Days

Add a 'think' string field to critical multi-parameter API signatures and log its content.

Set a complexity threshold (start τ=0.6) to only generate per-parameter reasoning when needed.

Run a small ToolBench-style test set to compare Pass/Win Rates before and after TAFC inside your agent stack.

Agent Features

Planning

Function Calling

Tool Use

Function CallingTool Selection

Frameworks

ReAct

Is Agentic

Yes

Optimization Features

Token Efficiency

Selective per-parameter reasoning to limit tokens

Infra Optimization

No model change, avoids retraining base models

System Optimization

Backwards-compatible signature augmentation and filtering at invocation

Training Optimization

Reasoning-guided tool description optimization (semantic, logic, action losses)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

ToolBench (ToolLLM) benchmark - ToolEval protocol; 16,000+ REST APIs

Risks & Boundaries

Limitations

Over-reasoning can hurt simple single-parameter functions; paper notes Standard FC outperforms TAFC in such cases

Adds token and latency overhead when reasoning is generated

When Not To Use

Simple single-parameter APIs where extra reasoning adds noise

Ultra-low-latency or strict token-budget environments

Failure Modes

Reasoning hallucination leads to incorrect parameter values

Over-reasoning increases error on trivial calls

Core Entities

Models

GPT-4o-0806Claude-3.5-SonnetQwen2.5-72BQwen2.5-32BQwen2.5-7BLlama-3.1-70BLlama-3.1-8B

Metrics

Pass RateWin RateParameter quality win rateReasoning coherenceOmission error rate

Datasets

ToolBench (ToolLLM) - 16,000+ REST APIs

Benchmarks

ToolBenchToolEval protocol

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

TAFC improves Pass Rate across model sizes

TAFC strongly improves judged parameter quality

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Key finding

Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

Key finding