Fin-RATE: a realistic SEC-filings benchmark that stresses cross-document, cross-year and cross-company financial reasoning

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and novel for finance use cases; it reveals retrieval and entity/year alignment as the primary failure points and offers an effective hierarchical retrieval fix.

Citations0

Evidence Strength0.85

Confidence0.87

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 35%

Novelty: 60%

Authors

Yidong Jiang, Junrong Chen, Eftychia Makri, Jialin Chen, Peiwen Li, Ali Maatouk, Leandros Tassiulas, Eliot Brenner, Bing Xiang, Rex Ying

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Fin-RATE shows that retrieval and entity/year alignment—not model size—are the main obstacles to reliable LLM support for multi-document financial analysis, so production systems must prioritize evidence routing and retriever design.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

Fin-RATE is a new benchmark built from 2,472 SEC filings (15,311 text chunks) to test large language models on three analyst-style tasks: Detail & Reasoning (single-chunk), Enterprise Comparison (multi-company), and Longitudinal Tracking (multi-year). The authors evaluate 17 LLMs under gold-context and retrieval-augmented (RAG) setups. Key takeaways: models can reason when relevant evidence is provided, but end-to-end performance collapses under realistic retrieval because retrievers miss or mis-rank crucial company- and year-level evidence. A hierarchical, company/year-aware retriever closes much of that gap.

Problem Statement

Existing financial QA benchmarks treat filings as isolated facts and measure only answer correctness. Real analyst work requires integrating multiple documents, aligning entities and years, and explaining comparisons. Current benchmarks and evaluation protocols do not diagnose whether failures come from retrieval, temporal/entity misalignment, generation hallucination, or poor reasoning.

Main Contribution

Fin-RATE: a 7,500-question benchmark from 15,311 SEC filing chunks covering 43 companies (2020–2025) and three tasks: DR-QA, EC-QA, LT-QA.

A 13-type error taxonomy and Likert scoring on five fine-grained dimensions to diagnose failures beyond binary correctness.

Key Findings

Accuracy falls sharply when moving from single-chunk to cross-year and cross-company tasks.

NumbersAccuracy drop: 18.60% (to LT-QA) and 14.35% (to EC-QA)

Practical UseExpect large accuracy drops when deploying LLMs on cross-document finance workflows; validate multi-document scenarios before production use.

Evidence RefAbstract; §4.3.1; Table 2

End-to-end RAG performance is much worse than gold-context performance; retrievers are the dominant bottleneck.

NumbersEnd-to-end accuracy ≤ 27% vs gold-context up to 57.48% (DR-QA)

Practical UseFocus engineering effort on retrieval and evidence routing (company/year) before optimizing model prompts or finetuning.

Evidence Ref§4.3.4; Table 2 and Table 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	57.48% (best model under gold context on DR-QA)	—	—	DR-QA (gold context)	§4.3.4; Table 2	Table 2
Accuracy	≈43–44% (with gold context)	—	—	EC-QA and LT-QA (gold context)	§4.3.4; Table 2	Table 2

What To Try In 7 Days

Run a smoke test: measure your retriever's Missing Evidence rate on representative cross-company and cross-year queries.

Bucket documents by company and fiscal year and add simple company/year routing before retrieval.

Combine a finance-specific embedding model with BM25 as separate signals, then check top-k overlap to detect hybrid noise.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://huggingface.co/datasets/GGLabYale/Fin-RATE

Data URLs

https://www.sec.gov/edgar/search-and-access https://huggingface.co/datasets/GGLabYale/Fin-RATE

Risks & Boundaries

Limitations

Dataset samples 43 companies and 34 in Appendix: not exhaustive across market cap or all sectors.

Gold-context evaluations overstate real-world performance because retriever access is idealized.

When Not To Use

Do not use Fin-RATE to benchmark short conversational or synthetic financial QA that does not require cross-document integration.

Avoid using gold-context results to claim end-to-end system readiness without retriever evaluation.

Failure Modes

Missing Evidence: retriever fails to return crucial company/year chunks.

Comparative Hallucination: model fabricates cross-company comparisons when evidence is asymmetric.

Core Entities

Models

GPT-5-websearchGPT-4.1GPT-4.1-websearchDeepSeek-V3DeepSeek-V3.2DeepSeek-R1Qwen3-8BQwen3-14BQwen3-30BQwen3-235BLlama-3.3-70B-InstructGPT-OSS-20BMIMO-V2-FlashFin-R1Fino1-14BFinanceConnect-13BTouchstoneGPT-7B-Instruct

Metrics

AccuracyRecall@KPrecision@KMRRLikert scores (Information Coverage, Reasoning, Factual Consistency, Clarity, Depth)Retrieval error types (Missing Evidence, Sorting Failure, Distractor Evidence)

Datasets

Fin-RATE (this paper)SEC EDGAR filings (source corpus)finance-embeddings-investopedia (used for dense retrieval)

Benchmarks

FinQADocFinQASEC-QAFinanceBenchPACIFIC

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Accuracy falls sharply when moving from single-chunk to cross-year and cross-company tasks.

End-to-end RAG performance is much worse than gold-context performance; retrievers are the dominant bottleneck.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding