FinS-Pilot: a 316-query, user-driven benchmark that tests real-time financial RAG with live API data

May 31, 20257 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and focused on live-data RAG for Chinese finance; small size and single-source logs limit generality.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Feng Wang, Yiding Sun, Jiaxin Mao, Wei Xue, Danqing Xu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Financial assistants must combine live market APIs with text retrieval; without live data numeric answers are wrong and web search improves content quality.

Who Should Care

Summary TLDR

FinS-Pilot is an open benchmark and dataset for financial retrieval-augmented generation (RAG). It is built from 316 real user queries from an online financial assistant, split into 104 numerical (time-sensitive) and 212 content queries. The benchmark injects real-time market data via the Tushare API and uses a dual retrieval setup (embedding-based text + web search) to simulate production RAG. Experiments on six Chinese LLMs show (1) a closed-source model (Xiaofa-1.0) reached 91.5% accuracy on content questions, (2) retrieval from web search (Bing) improves generation quality, and (3) LLMs without live data fail numeric tasks (example: DeepSeek-v3 w/o reference = 0% on numerical queries).

Problem Statement

Existing financial benchmarks focus on static reports or synthetic queries and lack real-time data and real user intent. This leaves a gap for evaluating RAG systems that must combine live market feeds and text corpora for accurate financial assistance.

Main Contribution

A user-driven dataset of 316 real queries from a production financial assistant (104 numerical, 212 content) with manual gold answers.

A workflow-aware intent taxonomy: 9 top-level categories and 62 second-level intents aligned to business pipelines.

Key Findings

Dataset composition: 316 real user queries covering both time-sensitive numbers and content questions.

Numbers316 queries (104 numerical, 212 content).

Practical UseUse this small, realistic set to test both live-data integration and content grounding before scaling.

Evidence RefSection 2.5 and Dataset composition paragraph.

A closed-source model, Xiaofa-1.0, achieved the best content accuracy in this benchmark.

NumbersXiaofa-1.0 accuracy = 91.5% (content queries).

Practical UseIf you need strong content responses on similar Chinese financial tasks, prioritize models with demonstrated external-data handling like Xiaofa-1.0, noting it is closed-source.

Evidence RefSection 3.2, content-based queries results.

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyXiaofa-1.0: 91.5%; other models: 71%–83% (on 212 content queries)FinS-Pilot content queriesSection 3.2 reports Xiaofa-1.0 = 91.5% and others 71%–83%.Section 3.2
AccuracyDeepSeek-v3 without reference: 0% on 104 numerical queriesFinS-Pilot numerical queriesSection 3.2 states DeepSeek-v3 w/o reference yields zero accuracy.Section 3.2

What To Try In 7 Days

Hook a market API (e.g., Tushare) to your LLM pipeline and test 10 numeric queries.

Collect 2–4 weeks of real user logs and extract common intent templates into a small taxonomy.

Run an A/B test: embedding-only retriever vs adding web search (Bing) and compare a few ROUGE/accuracy metrics.

Agent Features

Tool Use
external API calls (Tushare)web search (Bing)embedding retriever

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Small scale: only 316 queries limits statistical power.

Single-source logs risk user-distribution bias from one assistant.

When Not To Use

When you need large-scale, statistically robust benchmarks.

When evaluating multilingual or non-Chinese financial assistants.

Failure Modes

Unit conversion mistakes (e.g., percentages vs. basis points).

Temporal reference errors (using wrong timestamp or stale data).

Core Entities

Models

DeepSeek-v3DeepSeek-R1Doubao-1.5-proMoonshot-v1Baichuan4Xiaofa-1.0

Metrics

AccuracyROUGE-LBLEUcosine similarityhallucinationcompletenessrelevance

Datasets

FinS-Pilot

Benchmarks

FinanceBenchFinQAFiQA

Context Entities

Models

LAMBDAMMLU

Metrics

ROUGEBLEU

Datasets

LAMBDAMMLU

Benchmarks

LiveBench