Overview
Separating capability (invocation/execution) from compliance (timeliness/intent/domain) produces clearer failure signals useful for production audits.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
If you deploy LLM agents in finance, you must audit each tool call for timeliness, intent, and domain; FinToolBench makes those audits practical and repeatable.
Who Should Care
Summary TLDR
FinToolBench is a runnable, execution-grounded benchmark for financial LLM agents. It pairs 760 free-tier, executable financial tools with 295 tool-required questions and annotates every tool with three finance attributes: timeliness, intent type, and regulatory domain. The benchmark measures capability (tool invocation and execution) and compliance (timeliness, intent restraint, domain alignment) via new metrics (TIR, TESR, CER, TMR, IMR, DMR). The paper also provides FATR, a practical baseline that injects finance attributes into tool cards, stabilizes execution, and shows attribute injection trades slightly lower invocation for better conditional success and fewer mismatches. Results show
Problem Statement
Existing finance evaluations focus on static QA and ignore real tool execution. General tool benchmarks use toy or few financial APIs and miss finance-specific acceptability (timeliness, intent restraint, domain alignment). This makes it hard to audit whether an agent's tool calls are timely, non-transactional, and domain-appropriate, which matters in high-stakes finance.
Main Contribution
FinToolBench: a runnable benchmark with 760 free-tier executable financial tools and 295 tool-required questions producing full tool traces.
Finance-aware evaluation: call-level compliance metrics for timeliness (TMR), intent (IMR), and domain (DMR) plus capability metrics (TIR, TESR, CER, Soft Score, CSS).
Key Findings
FinToolBench scales to a large, runnable inventory.
Different planners trade off invocation vs precision.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Tool Invocation Rate (TIR) | Qwen3-8B 0.8712; Doubao 0.6508; GLM 0.4407; GPT-4o 0.2267 | — | — | FinToolBench (295 questions) | Table 3 reports per-model TIR | Table 3 |
| Conditional Execution Rate (CER) | GPT-4o 0.6176; Doubao 0.5; Qwen3-8B 0.3385; GLM 0.4769 | — | — | FinToolBench | CER = TESR/TIR to decouple coverage and reliability | Sec.3.4, Table 3 |
What To Try In 7 Days
Run FATR on 20 representative questions to see how your planner handles tool selection and compliance.
Annotate your internal tools with timeliness, intent, and domain tags and expose them in tool cards.
Log and inspect tool traces per query and compute TIR/CER and TMR/IMR/DMR to separate coverage from compliance.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Relies on free-tier RapidAPI and AkShare; endpoints can change or require keys.
Tool attributes and some labels are produced by LLM and may be noisy.
When Not To Use
High-frequency trading or systems that require proprietary real-time feeds not available via free-tier APIs.
Legal or regulatory decisions where human sign-off is mandated and automated audits are insufficient.
Failure Modes
Tool drift: endpoints change or return stale data.
Argument instantiation errors causing low CER despite high invocation.

