A runnable benchmark of 760 real financial tools and 295 tool-required questions for auditing LLM agents

Overview

Decision SnapshotReady For Pilot

Separating capability (invocation/execution) from compliance (timeliness/intent/domain) produces clearer failure signals useful for production audits.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Jiaxuan Lu, Kong Wang, Yemin Wang, Qingmei Tang, Hongwei Zeng, Xiang Chen, Jiahao Pi, Shujian Deng, Lingzhi Chen, Yi Fu, Kehua Yang, Xiao Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy LLM agents in finance, you must audit each tool call for timeliness, intent, and domain; FinToolBench makes those audits practical and repeatable.

Who Should Care

Founder CTO Product Manager ML Engineer Data Scientist

Summary TLDR

FinToolBench is a runnable, execution-grounded benchmark for financial LLM agents. It pairs 760 free-tier, executable financial tools with 295 tool-required questions and annotates every tool with three finance attributes: timeliness, intent type, and regulatory domain. The benchmark measures capability (tool invocation and execution) and compliance (timeliness, intent restraint, domain alignment) via new metrics (TIR, TESR, CER, TMR, IMR, DMR). The paper also provides FATR, a practical baseline that injects finance attributes into tool cards, stabilizes execution, and shows attribute injection trades slightly lower invocation for better conditional success and fewer mismatches. Results show

Problem Statement

Existing finance evaluations focus on static QA and ignore real tool execution. General tool benchmarks use toy or few financial APIs and miss finance-specific acceptability (timeliness, intent restraint, domain alignment). This makes it hard to audit whether an agent's tool calls are timely, non-transactional, and domain-appropriate, which matters in high-stakes finance.

Main Contribution

FinToolBench: a runnable benchmark with 760 free-tier executable financial tools and 295 tool-required questions producing full tool traces.

Finance-aware evaluation: call-level compliance metrics for timeliness (TMR), intent (IMR), and domain (DMR) plus capability metrics (TIR, TESR, CER, Soft Score, CSS).

Key Findings

FinToolBench scales to a large, runnable inventory.

Numbers760 tools; 295 questions (166 single-tool, 129 multi-tool)

Practical UseYou can evaluate agents on real tool execution and audit every tool call rather than only final answers.

Evidence RefAbstract, Sec.3.3

Different planners trade off invocation vs precision.

NumbersQwen3-8B TIR=0.8712, CER=0.3385; GPT-4o TIR=0.2267, CER=0.6176

Practical UseAggressive tool use raises coverage but also execution errors; conservative agents call tools less but are more reliable when they do.

Evidence RefTable 3, Sec.6.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Tool Invocation Rate (TIR)	Qwen3-8B 0.8712; Doubao 0.6508; GLM 0.4407; GPT-4o 0.2267	—	—	FinToolBench (295 questions)	Table 3 reports per-model TIR	Table 3
Conditional Execution Rate (CER)	GPT-4o 0.6176; Doubao 0.5; Qwen3-8B 0.3385; GLM 0.4769	—	—	FinToolBench	CER = TESR/TIR to decouple coverage and reliability	Sec.3.4, Table 3

What To Try In 7 Days

Run FATR on 20 representative questions to see how your planner handles tool selection and compliance.

Annotate your internal tools with timeliness, intent, and domain tags and expose them in tool cards.

Log and inspect tool traces per query and compute TIR/CER and TMR/IMR/DMR to separate coverage from compliance.

Agent Features

Memory

short-term trace logging (no long-term retrieval memory)

Planning

Constraint-aware planning (infer timeliness/intent/domain)ReAct loop capped at max_steps

Tool Use

Top-K retrieval (BGE-M3) of tool cardsFunction-style tool calling with JSON argumentsMulti-tool chains and multi-step traces

Frameworks

FATRTool manifest with normalized signatures

Is Agentic

Yes

Architectures

ReAct-style planningretriever + planner + executor pipeline

Collaboration

single-agent planning (no multi-agent coordination)

Optimization Features

Token Efficiency

compress long tool outputs before returning to planner

Infra Optimization

deterministic caching of tool outputs

System Optimization

per-call timeout (60s) and max steps to limit latency

Inference Optimization

output compression to limit context growthcaching and retries to stabilize execution

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Double-wk/FinToolBench.git

Data URLs

https://github.com/Double-wk/FinToolBench.git

Risks & Boundaries

Limitations

Relies on free-tier RapidAPI and AkShare; endpoints can change or require keys.

Tool attributes and some labels are produced by LLM and may be noisy.

When Not To Use

High-frequency trading or systems that require proprietary real-time feeds not available via free-tier APIs.

Legal or regulatory decisions where human sign-off is mandated and automated audits are insufficient.

Failure Modes

Tool drift: endpoints change or return stale data.

Argument instantiation errors causing low CER despite high invocation.

Core Entities

Models

Doubao-Seed-1.6Qwen3-8BGLM-4.7-FlashGPT-4oGPT-5.1 (judge)BGE-M3 (retriever)

Metrics

TIRTESRCERSoft ScoreCSSTMRIMRDMR

Datasets

FinToolBenchFinanceBenchOpenFinData

Benchmarks

FinToolBenchStableToolBenchAPI-BankAgentBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FinToolBench scales to a large, runnable inventory.

Different planners trade off invocation vs precision.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding