A runnable benchmark of 760 real financial tools and 295 tool-required questions for auditing LLM agents

March 9, 20267 min

Overview

Decision SnapshotReady For Pilot

Separating capability (invocation/execution) from compliance (timeliness/intent/domain) produces clearer failure signals useful for production audits.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Jiaxuan Lu, Kong Wang, Yemin Wang, Qingmei Tang, Hongwei Zeng, Xiang Chen, Jiahao Pi, Shujian Deng, Lingzhi Chen, Yi Fu, Kehua Yang, Xiao Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy LLM agents in finance, you must audit each tool call for timeliness, intent, and domain; FinToolBench makes those audits practical and repeatable.

Who Should Care

Summary TLDR

FinToolBench is a runnable, execution-grounded benchmark for financial LLM agents. It pairs 760 free-tier, executable financial tools with 295 tool-required questions and annotates every tool with three finance attributes: timeliness, intent type, and regulatory domain. The benchmark measures capability (tool invocation and execution) and compliance (timeliness, intent restraint, domain alignment) via new metrics (TIR, TESR, CER, TMR, IMR, DMR). The paper also provides FATR, a practical baseline that injects finance attributes into tool cards, stabilizes execution, and shows attribute injection trades slightly lower invocation for better conditional success and fewer mismatches. Results show

Problem Statement

Existing finance evaluations focus on static QA and ignore real tool execution. General tool benchmarks use toy or few financial APIs and miss finance-specific acceptability (timeliness, intent restraint, domain alignment). This makes it hard to audit whether an agent's tool calls are timely, non-transactional, and domain-appropriate, which matters in high-stakes finance.

Main Contribution

FinToolBench: a runnable benchmark with 760 free-tier executable financial tools and 295 tool-required questions producing full tool traces.

Finance-aware evaluation: call-level compliance metrics for timeliness (TMR), intent (IMR), and domain (DMR) plus capability metrics (TIR, TESR, CER, Soft Score, CSS).

Key Findings

FinToolBench scales to a large, runnable inventory.

Numbers760 tools; 295 questions (166 single-tool, 129 multi-tool)

Practical UseYou can evaluate agents on real tool execution and audit every tool call rather than only final answers.

Evidence RefAbstract, Sec.3.3

Different planners trade off invocation vs precision.

NumbersQwen3-8B TIR=0.8712, CER=0.3385; GPT-4o TIR=0.2267, CER=0.6176

Practical UseAggressive tool use raises coverage but also execution errors; conservative agents call tools less but are more reliable when they do.

Evidence RefTable 3, Sec.6.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Tool Invocation Rate (TIR)Qwen3-8B 0.8712; Doubao 0.6508; GLM 0.4407; GPT-4o 0.2267FinToolBench (295 questions)Table 3 reports per-model TIRTable 3
Conditional Execution Rate (CER)GPT-4o 0.6176; Doubao 0.5; Qwen3-8B 0.3385; GLM 0.4769FinToolBenchCER = TESR/TIR to decouple coverage and reliabilitySec.3.4, Table 3

What To Try In 7 Days

Run FATR on 20 representative questions to see how your planner handles tool selection and compliance.

Annotate your internal tools with timeliness, intent, and domain tags and expose them in tool cards.

Log and inspect tool traces per query and compute TIR/CER and TMR/IMR/DMR to separate coverage from compliance.

Agent Features

Memory
short-term trace logging (no long-term retrieval memory)
Planning
Constraint-aware planning (infer timeliness/intent/domain)ReAct loop capped at max_steps
Tool Use
Top-K retrieval (BGE-M3) of tool cardsFunction-style tool calling with JSON argumentsMulti-tool chains and multi-step traces
Frameworks
FATRTool manifest with normalized signatures
Is Agentic

Yes

Architectures
ReAct-style planningretriever + planner + executor pipeline
Collaboration
single-agent planning (no multi-agent coordination)

Optimization Features

Token Efficiency
compress long tool outputs before returning to planner
Infra Optimization
deterministic caching of tool outputs
System Optimization
per-call timeout (60s) and max steps to limit latency
Inference Optimization
output compression to limit context growthcaching and retries to stabilize execution

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies on free-tier RapidAPI and AkShare; endpoints can change or require keys.

Tool attributes and some labels are produced by LLM and may be noisy.

When Not To Use

High-frequency trading or systems that require proprietary real-time feeds not available via free-tier APIs.

Legal or regulatory decisions where human sign-off is mandated and automated audits are insufficient.

Failure Modes

Tool drift: endpoints change or return stale data.

Argument instantiation errors causing low CER despite high invocation.

Core Entities

Models

Doubao-Seed-1.6Qwen3-8BGLM-4.7-FlashGPT-4oGPT-5.1 (judge)BGE-M3 (retriever)

Metrics

TIRTESRCERSoft ScoreCSSTMRIMRDMR

Datasets

FinToolBenchFinanceBenchOpenFinData

Benchmarks

FinToolBenchStableToolBenchAPI-BankAgentBench