Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you deploy LLM agents in finance, you must audit each tool call for timeliness, intent, and domain; FinToolBench makes those audits practical and repeatable.
Summary TLDR
FinToolBench is a runnable, execution-grounded benchmark for financial LLM agents. It pairs 760 free-tier, executable financial tools with 295 tool-required questions and annotates every tool with three finance attributes: timeliness, intent type, and regulatory domain. The benchmark measures capability (tool invocation and execution) and compliance (timeliness, intent restraint, domain alignment) via new metrics (TIR, TESR, CER, TMR, IMR, DMR). The paper also provides FATR, a practical baseline that injects finance attributes into tool cards, stabilizes execution, and shows attribute injection trades slightly lower invocation for better conditional success and fewer mismatches. Results show
Problem Statement
Existing finance evaluations focus on static QA and ignore real tool execution. General tool benchmarks use toy or few financial APIs and miss finance-specific acceptability (timeliness, intent restraint, domain alignment). This makes it hard to audit whether an agent's tool calls are timely, non-transactional, and domain-appropriate, which matters in high-stakes finance.
Main Contribution
FinToolBench: a runnable benchmark with 760 free-tier executable financial tools and 295 tool-required questions producing full tool traces.
Finance-aware evaluation: call-level compliance metrics for timeliness (TMR), intent (IMR), and domain (DMR) plus capability metrics (TIR, TESR, CER, Soft Score, CSS).
FATR baseline: a pragmatic, model-agnostic finance-aware tool retrieval and planner wrapper that injects finance attributes and stabilizes execution (caching, retries, compression).
Open artifact: tool manifest, question set, and evaluation scripts published to reproduce runs.
Key Findings
FinToolBench scales to a large, runnable inventory.
Different planners trade off invocation vs precision.
Finance-attribute injection changes behavior and improves conditional reliability.
Execution traces reveal failure modes not seen in static benchmarks.
Results
Tool Invocation Rate (TIR)
Conditional Execution Rate (CER)
Soft Score / CSS (answer quality)
Compliance mismatch rates (IMR, DMR, TMR)
Who Should Care
What To Try In 7 Days
Run FATR on 20 representative questions to see how your planner handles tool selection and compliance.
Annotate your internal tools with timeliness, intent, and domain tags and expose them in tool cards.
Log and inspect tool traces per query and compute TIR/CER and TMR/IMR/DMR to separate coverage from compliance.
Agent Features
Memory
- short-term trace logging (no long-term retrieval memory)
Planning
- Constraint-aware planning (infer timeliness/intent/domain)
- ReAct loop capped at max_steps
Tool Use
- Top-K retrieval (BGE-M3) of tool cards
- Function-style tool calling with JSON arguments
- Multi-tool chains and multi-step traces
Frameworks
- FATR
- Tool manifest with normalized signatures
Is Agentic
true
Architectures
- ReAct-style planning
- retriever + planner + executor pipeline
Collaboration
- single-agent planning (no multi-agent coordination)
Optimization Features
Token Efficiency
- compress long tool outputs before returning to planner
Infra Optimization
- deterministic caching of tool outputs
System Optimization
- per-call timeout (60s) and max steps to limit latency
Inference Optimization
- output compression to limit context growth
- caching and retries to stabilize execution
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on free-tier RapidAPI and AkShare; endpoints can change or require keys.
- Tool attributes and some labels are produced by LLM and may be noisy.
- LLM-as-judge (GPT-5.1) introduces evaluation variance and potential bias.
When Not To Use
- High-frequency trading or systems that require proprietary real-time feeds not available via free-tier APIs.
- Legal or regulatory decisions where human sign-off is mandated and automated audits are insufficient.
Failure Modes
- Tool drift: endpoints change or return stale data.
- Argument instantiation errors causing low CER despite high invocation.
- Proxy-based numeric estimates: plausible traces that yield wrong numbers.
- Judge instability: LLM scoring variance can blur small differences.
Core Entities
Models
- Doubao-Seed-1.6
- Qwen3-8B
- GLM-4.7-Flash
- GPT-4o
- GPT-5.1 (judge)
- BGE-M3 (retriever)
Metrics
- TIR
- TESR
- CER
- Soft Score
- CSS
- TMR
- IMR
- DMR
Datasets
- FinToolBench
- FinanceBench
- OpenFinData
Benchmarks
- FinToolBench
- StableToolBench
- API-Bank
- AgentBench

