Overview
The benchmark is practical and focused on live-data RAG for Chinese finance; small size and single-source logs limit generality.
Citations0
Evidence Strength0.60
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
Financial assistants must combine live market APIs with text retrieval; without live data numeric answers are wrong and web search improves content quality.
Who Should Care
Summary TLDR
FinS-Pilot is an open benchmark and dataset for financial retrieval-augmented generation (RAG). It is built from 316 real user queries from an online financial assistant, split into 104 numerical (time-sensitive) and 212 content queries. The benchmark injects real-time market data via the Tushare API and uses a dual retrieval setup (embedding-based text + web search) to simulate production RAG. Experiments on six Chinese LLMs show (1) a closed-source model (Xiaofa-1.0) reached 91.5% accuracy on content questions, (2) retrieval from web search (Bing) improves generation quality, and (3) LLMs without live data fail numeric tasks (example: DeepSeek-v3 w/o reference = 0% on numerical queries).
Problem Statement
Existing financial benchmarks focus on static reports or synthetic queries and lack real-time data and real user intent. This leaves a gap for evaluating RAG systems that must combine live market feeds and text corpora for accurate financial assistance.
Main Contribution
A user-driven dataset of 316 real queries from a production financial assistant (104 numerical, 212 content) with manual gold answers.
A workflow-aware intent taxonomy: 9 top-level categories and 62 second-level intents aligned to business pipelines.
Key Findings
Dataset composition: 316 real user queries covering both time-sensitive numbers and content questions.
A closed-source model, Xiaofa-1.0, achieved the best content accuracy in this benchmark.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Xiaofa-1.0: 91.5%; other models: 71%–83% (on 212 content queries) | — | — | FinS-Pilot content queries | Section 3.2 reports Xiaofa-1.0 = 91.5% and others 71%–83%. | Section 3.2 |
| Accuracy | DeepSeek-v3 without reference: 0% on 104 numerical queries | — | — | FinS-Pilot numerical queries | Section 3.2 states DeepSeek-v3 w/o reference yields zero accuracy. | Section 3.2 |
What To Try In 7 Days
Hook a market API (e.g., Tushare) to your LLM pipeline and test 10 numeric queries.
Collect 2–4 weeks of real user logs and extract common intent templates into a small taxonomy.
Run an A/B test: embedding-only retriever vs adding web search (Bing) and compare a few ROUGE/accuracy metrics.
Agent Features
Tool Use
Reproducibility
Risks & Boundaries
Limitations
Small scale: only 316 queries limits statistical power.
Single-source logs risk user-distribution bias from one assistant.
When Not To Use
When you need large-scale, statistically robust benchmarks.
When evaluating multilingual or non-Chinese financial assistants.
Failure Modes
Unit conversion mistakes (e.g., percentages vs. basis points).
Temporal reference errors (using wrong timestamp or stale data).

