FinS-Pilot: a 316-query, user-driven benchmark that tests real-time financial RAG with live API data

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and focused on live-data RAG for Chinese finance; small size and single-source logs limit generality.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Feng Wang, Yiding Sun, Jiaxin Mao, Wei Xue, Danqing Xu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Financial assistants must combine live market APIs with text retrieval; without live data numeric answers are wrong and web search improves content quality.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

FinS-Pilot is an open benchmark and dataset for financial retrieval-augmented generation (RAG). It is built from 316 real user queries from an online financial assistant, split into 104 numerical (time-sensitive) and 212 content queries. The benchmark injects real-time market data via the Tushare API and uses a dual retrieval setup (embedding-based text + web search) to simulate production RAG. Experiments on six Chinese LLMs show (1) a closed-source model (Xiaofa-1.0) reached 91.5% accuracy on content questions, (2) retrieval from web search (Bing) improves generation quality, and (3) LLMs without live data fail numeric tasks (example: DeepSeek-v3 w/o reference = 0% on numerical queries).

Problem Statement

Existing financial benchmarks focus on static reports or synthetic queries and lack real-time data and real user intent. This leaves a gap for evaluating RAG systems that must combine live market feeds and text corpora for accurate financial assistance.

Main Contribution

A user-driven dataset of 316 real queries from a production financial assistant (104 numerical, 212 content) with manual gold answers.

A workflow-aware intent taxonomy: 9 top-level categories and 62 second-level intents aligned to business pipelines.

Key Findings

Dataset composition: 316 real user queries covering both time-sensitive numbers and content questions.

Numbers316 queries (104 numerical, 212 content).

Practical UseUse this small, realistic set to test both live-data integration and content grounding before scaling.

Evidence RefSection 2.5 and Dataset composition paragraph.

A closed-source model, Xiaofa-1.0, achieved the best content accuracy in this benchmark.

NumbersXiaofa-1.0 accuracy = 91.5% (content queries).

Practical UseIf you need strong content responses on similar Chinese financial tasks, prioritize models with demonstrated external-data handling like Xiaofa-1.0, noting it is closed-source.

Evidence RefSection 3.2, content-based queries results.

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Xiaofa-1.0: 91.5%; other models: 71%–83% (on 212 content queries)	—	—	FinS-Pilot content queries	Section 3.2 reports Xiaofa-1.0 = 91.5% and others 71%–83%.	Section 3.2
Accuracy	DeepSeek-v3 without reference: 0% on 104 numerical queries	—	—	FinS-Pilot numerical queries	Section 3.2 states DeepSeek-v3 w/o reference yields zero accuracy.	Section 3.2

What To Try In 7 Days

Hook a market API (e.g., Tushare) to your LLM pipeline and test 10 numeric queries.

Collect 2–4 weeks of real user logs and extract common intent templates into a small taxonomy.

Run an A/B test: embedding-only retriever vs adding web search (Bing) and compare a few ROUGE/accuracy metrics.

Agent Features

Tool Use

external API calls (Tushare)web search (Bing)embedding retriever

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/PhealenWang/financial_rag_benchmark

Data URLs

https://github.com/PhealenWang/financial_rag_benchmark

Risks & Boundaries

Limitations

Small scale: only 316 queries limits statistical power.

Single-source logs risk user-distribution bias from one assistant.

When Not To Use

When you need large-scale, statistically robust benchmarks.

When evaluating multilingual or non-Chinese financial assistants.

Failure Modes

Unit conversion mistakes (e.g., percentages vs. basis points).

Temporal reference errors (using wrong timestamp or stale data).

Core Entities

Models

DeepSeek-v3DeepSeek-R1Doubao-1.5-proMoonshot-v1Baichuan4Xiaofa-1.0

Metrics

AccuracyROUGE-LBLEUcosine similarityhallucinationcompletenessrelevance

Datasets

FinS-Pilot

Benchmarks

FinanceBenchFinQAFiQA

Context Entities

Models

LAMBDAMMLU

Metrics

ROUGEBLEU

Datasets

LAMBDAMMLU

Benchmarks

LiveBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dataset composition: 316 real user queries covering both time-sensitive numbers and content questions.

A closed-source model, Xiaofa-1.0, achieved the best content accuracy in this benchmark.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding