Fin-RATE: a realistic SEC-filings benchmark that stresses cross-document, cross-year and cross-company financial reasoning

February 7, 20268 min

Overview

Production Readiness

0.35

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Yidong Jiang, Junrong Chen, Eftychia Makri, Jialin Chen, Peiwen Li, Ali Maatouk, Leandros Tassiulas, Eliot Brenner, Bing Xiang, Rex Ying

Links

Abstract / PDF

Why It Matters For Business

Fin-RATE shows that retrieval and entity/year alignment—not model size—are the main obstacles to reliable LLM support for multi-document financial analysis, so production systems must prioritize evidence routing and retriever design.

Summary TLDR

Fin-RATE is a new benchmark built from 2,472 SEC filings (15,311 text chunks) to test large language models on three analyst-style tasks: Detail & Reasoning (single-chunk), Enterprise Comparison (multi-company), and Longitudinal Tracking (multi-year). The authors evaluate 17 LLMs under gold-context and retrieval-augmented (RAG) setups. Key takeaways: models can reason when relevant evidence is provided, but end-to-end performance collapses under realistic retrieval because retrievers miss or mis-rank crucial company- and year-level evidence. A hierarchical, company/year-aware retriever closes much of that gap.

Problem Statement

Existing financial QA benchmarks treat filings as isolated facts and measure only answer correctness. Real analyst work requires integrating multiple documents, aligning entities and years, and explaining comparisons. Current benchmarks and evaluation protocols do not diagnose whether failures come from retrieval, temporal/entity misalignment, generation hallucination, or poor reasoning.

Main Contribution

Fin-RATE: a 7,500-question benchmark from 15,311 SEC filing chunks covering 43 companies (2020–2025) and three tasks: DR-QA, EC-QA, LT-QA.

A 13-type error taxonomy and Likert scoring on five fine-grained dimensions to diagnose failures beyond binary correctness.

Large-scale evaluation of 17 LLMs under gold-context and RAG, and demonstration that retrieval—more than generation—drives end-to-end failures.

A hierarchical, entity/year-aware retrieval pipeline that substantially improves evidence coverage and ranking over standard hybrid retrievers.

Key Findings

Accuracy falls sharply when moving from single-chunk to cross-year and cross-company tasks.

NumbersAccuracy drop: 18.60% (to LT-QA) and 14.35% (to EC-QA)

End-to-end RAG performance is much worse than gold-context performance; retrievers are the dominant bottleneck.

NumbersEnd-to-end accuracy ≤ 27% vs gold-context up to 57.48% (DR-QA)

EC-QA suffers massive 'missing evidence' failures under RAG.

NumbersMissing Evidence = 75.44% on EC-QA test subset

Dense vector (finance-tuned) retrievers outperform BM25 on cross-company semantic matches but lose temporal precision.

NumbersEC-QA: VF recall 23.8% vs BM25 10.4%

A hierarchical, bucketed retrieval strategy gives large gains in entity/year coverage and ranking.

NumbersEC-QA entity hit rate 13.2% → 52.9%; LT-QA year coverage 94.6% → 98.7%; P@1 and R@10 improvements reported

Results

Accuracy

Value57.48% (best model under gold context on DR-QA)

Accuracy

Value≈43–44% (with gold context)

Accuracy

Value<=27% (across tasks with retrieved context)

BaselineGold-context accuracy up to 57.48%

Retriever recall (BM25 R@10 on DR-QA)

Value41.16% R@10

Retriever recall (finance-vector VF on EC-QA)

Value23.8% R@10 (VF) vs BM25 10.4% R@10

BaselineBM25

Retrieval Missing Evidence (EC-QA)

ValueMissing Evidence = 75.44%

Hierarchical retrieval improvements (EC-QA)

ValueEntity hit rate 13.2% → 52.9%; R@10 +12.76 pts; MRR +16.64 pts

BaselineBM25+Reranker

Who Should Care

What To Try In 7 Days

Run a smoke test: measure your retriever's Missing Evidence rate on representative cross-company and cross-year queries.

Bucket documents by company and fiscal year and add simple company/year routing before retrieval.

Combine a finance-specific embedding model with BM25 as separate signals, then check top-k overlap to detect hybrid noise.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Dataset samples 43 companies and 34 in Appendix: not exhaustive across market cap or all sectors.
  • Gold-context evaluations overstate real-world performance because retriever access is idealized.
  • LLM-as-Judge fusion depends on judge models and tuned weights that may bias labels toward those judges' failure modes.

When Not To Use

  • Do not use Fin-RATE to benchmark short conversational or synthetic financial QA that does not require cross-document integration.
  • Avoid using gold-context results to claim end-to-end system readiness without retriever evaluation.

Failure Modes

  • Missing Evidence: retriever fails to return crucial company/year chunks.
  • Comparative Hallucination: model fabricates cross-company comparisons when evidence is asymmetric.
  • Time Mismatch: model references wrong fiscal year or treats multi-year info as independent.

Core Entities

Models

  • GPT-5-websearch
  • GPT-4.1
  • GPT-4.1-websearch
  • DeepSeek-V3
  • DeepSeek-V3.2
  • DeepSeek-R1
  • Qwen3-8B
  • Qwen3-14B
  • Qwen3-30B
  • Qwen3-235B
  • Llama-3.3-70B-Instruct
  • GPT-OSS-20B
  • MIMO-V2-Flash
  • Fin-R1
  • Fino1-14B
  • FinanceConnect-13B
  • TouchstoneGPT-7B-Instruct

Metrics

  • Accuracy
  • Recall@K
  • Precision@K
  • MRR
  • Likert scores (Information Coverage, Reasoning, Factual Consistency, Clarity, Depth)
  • Retrieval error types (Missing Evidence, Sorting Failure, Distractor Evidence)

Datasets

  • Fin-RATE (this paper)
  • SEC EDGAR filings (source corpus)
  • finance-embeddings-investopedia (used for dense retrieval)

Benchmarks

  • FinQA
  • DocFinQA
  • SEC-QA
  • FinanceBench
  • PACIFIC