A runnable benchmark of 760 real financial tools and 295 tool-required questions for auditing LLM agents

March 9, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Jiaxuan Lu, Kong Wang, Yemin Wang, Qingmei Tang, Hongwei Zeng, Xiang Chen, Jiahao Pi, Shujian Deng, Lingzhi Chen, Yi Fu, Kehua Yang, Xiao Sun

Links

Abstract / PDF

Why It Matters For Business

If you deploy LLM agents in finance, you must audit each tool call for timeliness, intent, and domain; FinToolBench makes those audits practical and repeatable.

Summary TLDR

FinToolBench is a runnable, execution-grounded benchmark for financial LLM agents. It pairs 760 free-tier, executable financial tools with 295 tool-required questions and annotates every tool with three finance attributes: timeliness, intent type, and regulatory domain. The benchmark measures capability (tool invocation and execution) and compliance (timeliness, intent restraint, domain alignment) via new metrics (TIR, TESR, CER, TMR, IMR, DMR). The paper also provides FATR, a practical baseline that injects finance attributes into tool cards, stabilizes execution, and shows attribute injection trades slightly lower invocation for better conditional success and fewer mismatches. Results show

Problem Statement

Existing finance evaluations focus on static QA and ignore real tool execution. General tool benchmarks use toy or few financial APIs and miss finance-specific acceptability (timeliness, intent restraint, domain alignment). This makes it hard to audit whether an agent's tool calls are timely, non-transactional, and domain-appropriate, which matters in high-stakes finance.

Main Contribution

FinToolBench: a runnable benchmark with 760 free-tier executable financial tools and 295 tool-required questions producing full tool traces.

Finance-aware evaluation: call-level compliance metrics for timeliness (TMR), intent (IMR), and domain (DMR) plus capability metrics (TIR, TESR, CER, Soft Score, CSS).

FATR baseline: a pragmatic, model-agnostic finance-aware tool retrieval and planner wrapper that injects finance attributes and stabilizes execution (caching, retries, compression).

Open artifact: tool manifest, question set, and evaluation scripts published to reproduce runs.

Key Findings

FinToolBench scales to a large, runnable inventory.

Numbers760 tools; 295 questions (166 single-tool, 129 multi-tool)

Different planners trade off invocation vs precision.

NumbersQwen3-8B TIR=0.8712, CER=0.3385; GPT-4o TIR=0.2267, CER=0.6176

Finance-attribute injection changes behavior and improves conditional reliability.

NumbersAttribute injection reduces TIR slightly while increasing CER and reducing mismatch rates (Fig.5)

Execution traces reveal failure modes not seen in static benchmarks.

Numbers103/295 runs had no tool call; 78 multi-tool runs (three-tool traces common) (Sec.6.3)

Results

Tool Invocation Rate (TIR)

ValueQwen3-8B 0.8712; Doubao 0.6508; GLM 0.4407; GPT-4o 0.2267

Conditional Execution Rate (CER)

ValueGPT-4o 0.6176; Doubao 0.5; Qwen3-8B 0.3385; GLM 0.4769

Soft Score / CSS (answer quality)

ValueQwen3-8B Soft=0.4234; Doubao Soft=0.3958; GPT-4o CSS=0.67

Compliance mismatch rates (IMR, DMR, TMR)

ValueExample: Qwen IMR=0.6887, DMR=0.1673; Doubao IMR=0.6563, DMR=0.1719; GPT-4o IMR=0.5, DMR=0.1176

Who Should Care

What To Try In 7 Days

Run FATR on 20 representative questions to see how your planner handles tool selection and compliance.

Annotate your internal tools with timeliness, intent, and domain tags and expose them in tool cards.

Log and inspect tool traces per query and compute TIR/CER and TMR/IMR/DMR to separate coverage from compliance.

Agent Features

Memory

  • short-term trace logging (no long-term retrieval memory)

Planning

  • Constraint-aware planning (infer timeliness/intent/domain)
  • ReAct loop capped at max_steps

Tool Use

  • Top-K retrieval (BGE-M3) of tool cards
  • Function-style tool calling with JSON arguments
  • Multi-tool chains and multi-step traces

Frameworks

  • FATR
  • Tool manifest with normalized signatures

Is Agentic

true

Architectures

  • ReAct-style planning
  • retriever + planner + executor pipeline

Collaboration

  • single-agent planning (no multi-agent coordination)

Optimization Features

Token Efficiency

  • compress long tool outputs before returning to planner

Infra Optimization

  • deterministic caching of tool outputs

System Optimization

  • per-call timeout (60s) and max steps to limit latency

Inference Optimization

  • output compression to limit context growth
  • caching and retries to stabilize execution

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on free-tier RapidAPI and AkShare; endpoints can change or require keys.
  • Tool attributes and some labels are produced by LLM and may be noisy.
  • LLM-as-judge (GPT-5.1) introduces evaluation variance and potential bias.

When Not To Use

  • High-frequency trading or systems that require proprietary real-time feeds not available via free-tier APIs.
  • Legal or regulatory decisions where human sign-off is mandated and automated audits are insufficient.

Failure Modes

  • Tool drift: endpoints change or return stale data.
  • Argument instantiation errors causing low CER despite high invocation.
  • Proxy-based numeric estimates: plausible traces that yield wrong numbers.
  • Judge instability: LLM scoring variance can blur small differences.

Core Entities

Models

  • Doubao-Seed-1.6
  • Qwen3-8B
  • GLM-4.7-Flash
  • GPT-4o
  • GPT-5.1 (judge)
  • BGE-M3 (retriever)

Metrics

  • TIR
  • TESR
  • CER
  • Soft Score
  • CSS
  • TMR
  • IMR
  • DMR

Datasets

  • FinToolBench
  • FinanceBench
  • OpenFinData

Benchmarks

  • FinToolBench
  • StableToolBench
  • API-Bank
  • AgentBench