Fin-RATE: a realistic SEC-filings benchmark that stresses cross-document, cross-year and cross-company financial reasoning

February 7, 20268 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and novel for finance use cases; it reveals retrieval and entity/year alignment as the primary failure points and offers an effective hierarchical retrieval fix.

Citations0

Evidence Strength0.85

Confidence0.87

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 35%

Novelty: 60%

Authors

Yidong Jiang, Junrong Chen, Eftychia Makri, Jialin Chen, Peiwen Li, Ali Maatouk, Leandros Tassiulas, Eliot Brenner, Bing Xiang, Rex Ying

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Fin-RATE shows that retrieval and entity/year alignment—not model size—are the main obstacles to reliable LLM support for multi-document financial analysis, so production systems must prioritize evidence routing and retriever design.

Who Should Care

Summary TLDR

Fin-RATE is a new benchmark built from 2,472 SEC filings (15,311 text chunks) to test large language models on three analyst-style tasks: Detail & Reasoning (single-chunk), Enterprise Comparison (multi-company), and Longitudinal Tracking (multi-year). The authors evaluate 17 LLMs under gold-context and retrieval-augmented (RAG) setups. Key takeaways: models can reason when relevant evidence is provided, but end-to-end performance collapses under realistic retrieval because retrievers miss or mis-rank crucial company- and year-level evidence. A hierarchical, company/year-aware retriever closes much of that gap.

Problem Statement

Existing financial QA benchmarks treat filings as isolated facts and measure only answer correctness. Real analyst work requires integrating multiple documents, aligning entities and years, and explaining comparisons. Current benchmarks and evaluation protocols do not diagnose whether failures come from retrieval, temporal/entity misalignment, generation hallucination, or poor reasoning.

Main Contribution

Fin-RATE: a 7,500-question benchmark from 15,311 SEC filing chunks covering 43 companies (2020–2025) and three tasks: DR-QA, EC-QA, LT-QA.

A 13-type error taxonomy and Likert scoring on five fine-grained dimensions to diagnose failures beyond binary correctness.

Key Findings

Accuracy falls sharply when moving from single-chunk to cross-year and cross-company tasks.

NumbersAccuracy drop: 18.60% (to LT-QA) and 14.35% (to EC-QA)

Practical UseExpect large accuracy drops when deploying LLMs on cross-document finance workflows; validate multi-document scenarios before production use.

Evidence RefAbstract; §4.3.1; Table 2

End-to-end RAG performance is much worse than gold-context performance; retrievers are the dominant bottleneck.

NumbersEnd-to-end accuracy ≤ 27% vs gold-context up to 57.48% (DR-QA)

Practical UseFocus engineering effort on retrieval and evidence routing (company/year) before optimizing model prompts or finetuning.

Evidence Ref§4.3.4; Table 2 and Table 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy57.48% (best model under gold context on DR-QA)DR-QA (gold context)§4.3.4; Table 2Table 2
Accuracy≈4344% (with gold context)EC-QA and LT-QA (gold context)§4.3.4; Table 2Table 2

What To Try In 7 Days

Run a smoke test: measure your retriever's Missing Evidence rate on representative cross-company and cross-year queries.

Bucket documents by company and fiscal year and add simple company/year routing before retrieval.

Combine a finance-specific embedding model with BM25 as separate signals, then check top-k overlap to detect hybrid noise.

Reproducibility

Risks & Boundaries

Limitations

Dataset samples 43 companies and 34 in Appendix: not exhaustive across market cap or all sectors.

Gold-context evaluations overstate real-world performance because retriever access is idealized.

When Not To Use

Do not use Fin-RATE to benchmark short conversational or synthetic financial QA that does not require cross-document integration.

Avoid using gold-context results to claim end-to-end system readiness without retriever evaluation.

Failure Modes

Missing Evidence: retriever fails to return crucial company/year chunks.

Comparative Hallucination: model fabricates cross-company comparisons when evidence is asymmetric.

Core Entities

Models

GPT-5-websearchGPT-4.1GPT-4.1-websearchDeepSeek-V3DeepSeek-V3.2DeepSeek-R1Qwen3-8BQwen3-14BQwen3-30BQwen3-235BLlama-3.3-70B-InstructGPT-OSS-20BMIMO-V2-FlashFin-R1Fino1-14BFinanceConnect-13BTouchstoneGPT-7B-Instruct

Metrics

AccuracyRecall@KPrecision@KMRRLikert scores (Information Coverage, Reasoning, Factual Consistency, Clarity, Depth)Retrieval error types (Missing Evidence, Sorting Failure, Distractor Evidence)

Datasets

Fin-RATE (this paper)SEC EDGAR filings (source corpus)finance-embeddings-investopedia (used for dense retrieval)

Benchmarks

FinQADocFinQASEC-QAFinanceBenchPACIFIC