Overview
Production Readiness
0.35
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Fin-RATE shows that retrieval and entity/year alignment—not model size—are the main obstacles to reliable LLM support for multi-document financial analysis, so production systems must prioritize evidence routing and retriever design.
Summary TLDR
Fin-RATE is a new benchmark built from 2,472 SEC filings (15,311 text chunks) to test large language models on three analyst-style tasks: Detail & Reasoning (single-chunk), Enterprise Comparison (multi-company), and Longitudinal Tracking (multi-year). The authors evaluate 17 LLMs under gold-context and retrieval-augmented (RAG) setups. Key takeaways: models can reason when relevant evidence is provided, but end-to-end performance collapses under realistic retrieval because retrievers miss or mis-rank crucial company- and year-level evidence. A hierarchical, company/year-aware retriever closes much of that gap.
Problem Statement
Existing financial QA benchmarks treat filings as isolated facts and measure only answer correctness. Real analyst work requires integrating multiple documents, aligning entities and years, and explaining comparisons. Current benchmarks and evaluation protocols do not diagnose whether failures come from retrieval, temporal/entity misalignment, generation hallucination, or poor reasoning.
Main Contribution
Fin-RATE: a 7,500-question benchmark from 15,311 SEC filing chunks covering 43 companies (2020–2025) and three tasks: DR-QA, EC-QA, LT-QA.
A 13-type error taxonomy and Likert scoring on five fine-grained dimensions to diagnose failures beyond binary correctness.
Large-scale evaluation of 17 LLMs under gold-context and RAG, and demonstration that retrieval—more than generation—drives end-to-end failures.
A hierarchical, entity/year-aware retrieval pipeline that substantially improves evidence coverage and ranking over standard hybrid retrievers.
Key Findings
Accuracy falls sharply when moving from single-chunk to cross-year and cross-company tasks.
End-to-end RAG performance is much worse than gold-context performance; retrievers are the dominant bottleneck.
EC-QA suffers massive 'missing evidence' failures under RAG.
Dense vector (finance-tuned) retrievers outperform BM25 on cross-company semantic matches but lose temporal precision.
A hierarchical, bucketed retrieval strategy gives large gains in entity/year coverage and ranking.
Results
Accuracy
Accuracy
Accuracy
Retriever recall (BM25 R@10 on DR-QA)
Retriever recall (finance-vector VF on EC-QA)
Retrieval Missing Evidence (EC-QA)
Hierarchical retrieval improvements (EC-QA)
Who Should Care
What To Try In 7 Days
Run a smoke test: measure your retriever's Missing Evidence rate on representative cross-company and cross-year queries.
Bucket documents by company and fiscal year and add simple company/year routing before retrieval.
Combine a finance-specific embedding model with BM25 as separate signals, then check top-k overlap to detect hybrid noise.
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Dataset samples 43 companies and 34 in Appendix: not exhaustive across market cap or all sectors.
- Gold-context evaluations overstate real-world performance because retriever access is idealized.
- LLM-as-Judge fusion depends on judge models and tuned weights that may bias labels toward those judges' failure modes.
When Not To Use
- Do not use Fin-RATE to benchmark short conversational or synthetic financial QA that does not require cross-document integration.
- Avoid using gold-context results to claim end-to-end system readiness without retriever evaluation.
Failure Modes
- Missing Evidence: retriever fails to return crucial company/year chunks.
- Comparative Hallucination: model fabricates cross-company comparisons when evidence is asymmetric.
- Time Mismatch: model references wrong fiscal year or treats multi-year info as independent.
Core Entities
Models
- GPT-5-websearch
- GPT-4.1
- GPT-4.1-websearch
- DeepSeek-V3
- DeepSeek-V3.2
- DeepSeek-R1
- Qwen3-8B
- Qwen3-14B
- Qwen3-30B
- Qwen3-235B
- Llama-3.3-70B-Instruct
- GPT-OSS-20B
- MIMO-V2-Flash
- Fin-R1
- Fino1-14B
- FinanceConnect-13B
- TouchstoneGPT-7B-Instruct
Metrics
- Accuracy
- Recall@K
- Precision@K
- MRR
- Likert scores (Information Coverage, Reasoning, Factual Consistency, Clarity, Depth)
- Retrieval error types (Missing Evidence, Sorting Failure, Distractor Evidence)
Datasets
- Fin-RATE (this paper)
- SEC EDGAR filings (source corpus)
- finance-embeddings-investopedia (used for dense retrieval)
Benchmarks
- FinQA
- DocFinQA
- SEC-QA
- FinanceBench
- PACIFIC

