Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Financial assistants must combine live market APIs with text retrieval; without live data numeric answers are wrong and web search improves content quality.
Summary TLDR
FinS-Pilot is an open benchmark and dataset for financial retrieval-augmented generation (RAG). It is built from 316 real user queries from an online financial assistant, split into 104 numerical (time-sensitive) and 212 content queries. The benchmark injects real-time market data via the Tushare API and uses a dual retrieval setup (embedding-based text + web search) to simulate production RAG. Experiments on six Chinese LLMs show (1) a closed-source model (Xiaofa-1.0) reached 91.5% accuracy on content questions, (2) retrieval from web search (Bing) improves generation quality, and (3) LLMs without live data fail numeric tasks (example: DeepSeek-v3 w/o reference = 0% on numerical queries).
Problem Statement
Existing financial benchmarks focus on static reports or synthetic queries and lack real-time data and real user intent. This leaves a gap for evaluating RAG systems that must combine live market feeds and text corpora for accurate financial assistance.
Main Contribution
A user-driven dataset of 316 real queries from a production financial assistant (104 numerical, 212 content) with manual gold answers.
A workflow-aware intent taxonomy: 9 top-level categories and 62 second-level intents aligned to business pipelines.
A hybrid retrieval design that integrates Tushare API (real-time numeric data) with embedding-based text retrieval and Bing web search.
An evaluation pipeline and automated code with comparisons across six Chinese LLMs; dataset and code published on GitHub.
Key Findings
Dataset composition: 316 real user queries covering both time-sensitive numbers and content questions.
A closed-source model, Xiaofa-1.0, achieved the best content accuracy in this benchmark.
Web search retrieval (Bing) improved generation quality vs. embedding-only retrieval.
LLMs without live references fail numeric tasks.
Common failure modes: unit conversion errors and wrong time references.
Results
Accuracy
Accuracy
ROUGE-L (retrieval impact example)
Who Should Care
What To Try In 7 Days
Hook a market API (e.g., Tushare) to your LLM pipeline and test 10 numeric queries.
Collect 2–4 weeks of real user logs and extract common intent templates into a small taxonomy.
Run an A/B test: embedding-only retriever vs adding web search (Bing) and compare a few ROUGE/accuracy metrics.
Agent Features
Tool Use
- external API calls (Tushare)
- web search (Bing)
- embedding retriever
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Small scale: only 316 queries limits statistical power.
- Single-source logs risk user-distribution bias from one assistant.
- Language and region: experiments and data are Chinese-focused.
- Some top-performing model(s) are closed-source, limiting reproducibility.
- Manual labeling is expensive and may not scale to larger testbeds.
When Not To Use
- When you need large-scale, statistically robust benchmarks.
- When evaluating multilingual or non-Chinese financial assistants.
- When you require fully open-model-only comparisons.
Failure Modes
- Unit conversion mistakes (e.g., percentages vs. basis points).
- Temporal reference errors (using wrong timestamp or stale data).
- Hallucinated facts when retrieval fails or is irrelevant.
Core Entities
Models
- DeepSeek-v3
- DeepSeek-R1
- Doubao-1.5-pro
- Moonshot-v1
- Baichuan4
- Xiaofa-1.0
Metrics
- Accuracy
- ROUGE-L
- BLEU
- cosine similarity
- hallucination
- completeness
- relevance
Datasets
- FinS-Pilot
Benchmarks
- FinanceBench
- FinQA
- FiQA
Context Entities
Models
- LAMBDA
- MMLU
Metrics
- ROUGE
- BLEU
Datasets
- LAMBDA
- MMLU
Benchmarks
- LiveBench

