Add an explicit 'think' reasoning field to function calls to improve parameter accuracy and explain decisions

January 26, 20266 min

Overview

Decision SnapshotReady For Pilot

TAFC is immediately deployable with existing LLMs and APIs; evidence from ToolBench shows consistent, moderate improvements and stronger gains on smaller models.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Lei Wei, Xiao Peng, Jinpeng Ou, Bin Wang

Links

Abstract / PDF / Data

Why It Matters For Business

TAFC improves parameter accuracy and adds explainability for API calls without changing LLMs, reducing silent failures and easing debugging for tool-driven agents in production.

Who Should Care

Summary TLDR

TAFC augments API function signatures with an explicit 'think' field so LLMs explain how they chose each parameter. This adds per-parameter reasoning only when needed, tunes description prompts automatically, and fits existing LLMs and APIs with no model changes. On ToolBench (16k+ APIs) TAFC raises Pass and Win Rates consistently (≈+1.6–2.5% pass rate; larger relative gains for small models) and yields much higher judged parameter quality (TAFC win ~69.6% vs 18.2%).

Problem Statement

Current function-calling lacks per-parameter transparency. Models must pick multiple interdependent parameters without explicit internal justification, which makes debugging hard and increases errors on complex multi-parameter API calls.

Main Contribution

Introduce TAFC: add a structured 'think' parameter to function signatures so models output reasoning alongside parameter values.

Trigger per-parameter reasoning selectively via a complexity score that measures dependency, type complexity, and constraints.

Key Findings

TAFC improves Pass Rate across model sizes

NumbersPass Rate +1.6% to +2.5% (varies by model/size)

Practical UseExpect a modest but consistent accuracy boost when calling multi-parameter APIs; benefits are larger on smaller models.

Evidence RefTable 1; Section 3.2

TAFC strongly improves judged parameter quality

NumbersAverage TAFC win 69.6% vs Standard 18.2%

Practical UseWhen you need parameters that match human intent or constraints, TAFC yields much better parameter choices on evaluated benchmarks.

Evidence RefTable 2; Section 3.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average Pass Rate (small models)≈+2.42.5% absoluteStandard Function Calling+2.42.5%ToolBench (Llama-3.1-8B, Qwen2.5-7B)Table 1 shows Llama-3.1-8B avg 27.3% → 29.7%; Qwen2.5-7B 28.7% → 31.2%Table 1
Average Win Rate improvement (small models)≈+2.93.1% absoluteStandard Function Calling+2.93.1%ToolBench (I1/I2/I3)Section 3.2 reports Win Rate gains of 2.9–3.1% for small modelsSection 3.2; Table 1

What To Try In 7 Days

Add a 'think' string field to critical multi-parameter API signatures and log its content.

Set a complexity threshold (start τ=0.6) to only generate per-parameter reasoning when needed.

Run a small ToolBench-style test set to compare Pass/Win Rates before and after TAFC inside your agent stack.

Agent Features

Planning
Function Calling
Tool Use
Function CallingTool Selection
Frameworks
ReAct
Is Agentic

Yes

Optimization Features

Token Efficiency
Selective per-parameter reasoning to limit tokens
Infra Optimization
No model change, avoids retraining base models
System Optimization
Backwards-compatible signature augmentation and filtering at invocation
Training Optimization
Reasoning-guided tool description optimization (semantic, logic, action losses)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

ToolBench (ToolLLM) benchmark - ToolEval protocol; 16,000+ REST APIs

Risks & Boundaries

Limitations

Over-reasoning can hurt simple single-parameter functions; paper notes Standard FC outperforms TAFC in such cases

Adds token and latency overhead when reasoning is generated

When Not To Use

Simple single-parameter APIs where extra reasoning adds noise

Ultra-low-latency or strict token-budget environments

Failure Modes

Reasoning hallucination leads to incorrect parameter values

Over-reasoning increases error on trivial calls

Core Entities

Models

GPT-4o-0806Claude-3.5-SonnetQwen2.5-72BQwen2.5-32BQwen2.5-7BLlama-3.1-70BLlama-3.1-8B

Metrics

Pass RateWin RateParameter quality win rateReasoning coherenceOmission error rate

Datasets

ToolBench (ToolLLM) - 16,000+ REST APIs

Benchmarks

ToolBenchToolEval protocol