Overview
Production Readiness
0.3
Novelty Score
0.4
Cost Impact Score
0.25
Citation Count
11
Why It Matters For Business
Out-of-the-box LLMs often fail on firm-specific financial questions. Firms must validate retrieval, prompt order, and verification steps before trusting outputs in decisions.
Summary TLDR
FINANCEBENCH is a 10,231-item open-book benchmark of financial questions, answers, and evidence covering 40 U.S. public companies and 360 filings. The authors evaluate 16 model+retrieval setups (GPT-4, GPT-4-Turbo, Claude2, Llama2) on a 150-case human-eval sample (2,400 labelled responses). Key findings: retrieval strategy and prompt order matter a lot; best realistic setup (GPT-4-Turbo long-context) is 79% correct on the sample; naive closed-book use is unusable (GPT-4-Turbo closed: 9% correct); hallucinations and incorrect numeric reasoning remain common. Use FINANCEBENCH to validate retrieval, prompt, and verification pipelines before deploying LLMs in finance.
Problem Statement
Finance teams need reliable, verifiable answers from LLMs on company filings. Existing QA datasets are not grounded in real analyst tasks or retrieval workflows. The field lacks an open-book benchmark that measures retrieval + reasoning on real financial documents.
Main Contribution
FINANCEBENCH dataset: 10,231 question-answer-evidence triplets across 40 companies and 360 filings (10Ks, 10Qs, 8Ks, earnings) covering 2015–2023.
Three question types and taxonomy: domain-relevant, novel-generated, and metrics-generated questions with labels for numerical/logical/extractive reasoning.
Human-eval sample: 150 diverse cases from the dataset; 16 model+retrieval configurations evaluated (2,400 responses manually labeled).
Empirical findings: retrieval method, long-context windows, and prompt order strongly affect accuracy; hallucinations and numeric errors remain frequent.
Key Findings
FINANCEBENCH contains 10,231 curated QA-evidence triplets.
Models perform poorly without retrieval: GPT-4-Turbo closed-book correct rate was very low.
Long context and accurate retrieval substantially raise accuracy.
Retrieval architecture matters: per-document stores beat a single shared store.
Prompt order affects long-context performance strongly.
Hallucinations and incorrect reasoning are common and model-dependent.
Results
GPT-4-Turbo (Closed Book) correct rate
GPT-4-Turbo (LongContext) correct rate
GPT-4-Turbo (Oracle) correct rate
GPT-4-Turbo single vs shared vector store
Overall across evaluated configs
Who Should Care
What To Try In 7 Days
Run FINANCEBENCH's 150-case open-source sample against your model configuration to get a quick baseline.
Compare shared vs per-document vector stores and measure correct/incorrect trade-offs.
Test both Context-First and Context-Last prompts on long documents and prefer Context-First for filings-in-context setups.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single-turn questions only; no conversational multi-turn evaluation (Sec 6)
- Only public filings and public companies; excludes private documents and some analyst sources (Sec 6)
- Some gold answers can be ambiguous depending on analyst assumptions (Sec 6)
- Long-context prompts were truncated for very long filings, which can hide retrieval failure modes (Sec 4)
When Not To Use
- When your use case requires multi-turn interactive analysis
- When you must handle private or proprietary documents not present in FINANCEBENCH
- For direct cross-company comparative questions across two full filings
Failure Modes
- Hallucinations: plausible but evidence-contradicting answers
- Incorrect numeric calculations or wrong units
- Refusals where a model could have answered with retrieval tuning
- Failure to retrieve the correct passage when using shared indexes
Core Entities
Models
- GPT-4
- GPT-4-Turbo
- Claude2
- Llama2
Metrics
- percent_correct
- percent_incorrect
- percent_failed
Datasets
- FINANCEBENCH
Benchmarks
- FinQA
- ConvFinQA
- TAT-QA

