FinS-Pilot: a 316-query, user-driven benchmark that tests real-time financial RAG with live API data

May 31, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Feng Wang, Yiding Sun, Jiaxin Mao, Wei Xue, Danqing Xu

Links

Abstract / PDF

Why It Matters For Business

Financial assistants must combine live market APIs with text retrieval; without live data numeric answers are wrong and web search improves content quality.

Summary TLDR

FinS-Pilot is an open benchmark and dataset for financial retrieval-augmented generation (RAG). It is built from 316 real user queries from an online financial assistant, split into 104 numerical (time-sensitive) and 212 content queries. The benchmark injects real-time market data via the Tushare API and uses a dual retrieval setup (embedding-based text + web search) to simulate production RAG. Experiments on six Chinese LLMs show (1) a closed-source model (Xiaofa-1.0) reached 91.5% accuracy on content questions, (2) retrieval from web search (Bing) improves generation quality, and (3) LLMs without live data fail numeric tasks (example: DeepSeek-v3 w/o reference = 0% on numerical queries).

Problem Statement

Existing financial benchmarks focus on static reports or synthetic queries and lack real-time data and real user intent. This leaves a gap for evaluating RAG systems that must combine live market feeds and text corpora for accurate financial assistance.

Main Contribution

A user-driven dataset of 316 real queries from a production financial assistant (104 numerical, 212 content) with manual gold answers.

A workflow-aware intent taxonomy: 9 top-level categories and 62 second-level intents aligned to business pipelines.

A hybrid retrieval design that integrates Tushare API (real-time numeric data) with embedding-based text retrieval and Bing web search.

An evaluation pipeline and automated code with comparisons across six Chinese LLMs; dataset and code published on GitHub.

Key Findings

Dataset composition: 316 real user queries covering both time-sensitive numbers and content questions.

Numbers316 queries (104 numerical, 212 content).

A closed-source model, Xiaofa-1.0, achieved the best content accuracy in this benchmark.

NumbersXiaofa-1.0 accuracy = 91.5% (content queries).

Web search retrieval (Bing) improved generation quality vs. embedding-only retrieval.

NumbersExample: Doubao ROU.=0.3469 (Bing) vs ~0.0845 (Base embedding).

LLMs without live references fail numeric tasks.

NumbersDeepSeek-v3 without reference => 0% accuracy on 104 numerical queries.

Common failure modes: unit conversion errors and wrong time references.

NumbersError analysis in Section 3.2 cites unit conversion and temporal misinterpretation as primary failures.

Results

Accuracy

ValueXiaofa-1.0: 91.5%; other models: 71%–83% (on 212 content queries)

Accuracy

ValueDeepSeek-v3 without reference: 0% on 104 numerical queries

ROUGE-L (retrieval impact example)

ValueDoubao-1.5-pro ROU.=0.3469 (Bing) vs ~0.0845 (Base embedding)

BaselineBase embedding

Who Should Care

What To Try In 7 Days

Hook a market API (e.g., Tushare) to your LLM pipeline and test 10 numeric queries.

Collect 2–4 weeks of real user logs and extract common intent templates into a small taxonomy.

Run an A/B test: embedding-only retriever vs adding web search (Bing) and compare a few ROUGE/accuracy metrics.

Agent Features

Tool Use

  • external API calls (Tushare)
  • web search (Bing)
  • embedding retriever

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small scale: only 316 queries limits statistical power.
  • Single-source logs risk user-distribution bias from one assistant.
  • Language and region: experiments and data are Chinese-focused.
  • Some top-performing model(s) are closed-source, limiting reproducibility.
  • Manual labeling is expensive and may not scale to larger testbeds.

When Not To Use

  • When you need large-scale, statistically robust benchmarks.
  • When evaluating multilingual or non-Chinese financial assistants.
  • When you require fully open-model-only comparisons.

Failure Modes

  • Unit conversion mistakes (e.g., percentages vs. basis points).
  • Temporal reference errors (using wrong timestamp or stale data).
  • Hallucinated facts when retrieval fails or is irrelevant.

Core Entities

Models

  • DeepSeek-v3
  • DeepSeek-R1
  • Doubao-1.5-pro
  • Moonshot-v1
  • Baichuan4
  • Xiaofa-1.0

Metrics

  • Accuracy
  • ROUGE-L
  • BLEU
  • cosine similarity
  • hallucination
  • completeness
  • relevance

Datasets

  • FinS-Pilot

Benchmarks

  • FinanceBench
  • FinQA
  • FiQA

Context Entities

Models

  • LAMBDA
  • MMLU

Metrics

  • ROUGE
  • BLEU

Datasets

  • LAMBDA
  • MMLU

Benchmarks

  • LiveBench