Overview
The benchmark and manual labels are solid for open-book financial QA; models tested are not yet reliable enough for high-stakes production without verification.
Citations11
Evidence Strength0.80
Confidence0.90
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 1/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 25%
Production readiness: 30%
Novelty: 40%
Why It Matters For Business
Out-of-the-box LLMs often fail on firm-specific financial questions. Firms must validate retrieval, prompt order, and verification steps before trusting outputs in decisions.
Who Should Care
Summary TLDR
FINANCEBENCH is a 10,231-item open-book benchmark of financial questions, answers, and evidence covering 40 U.S. public companies and 360 filings. The authors evaluate 16 model+retrieval setups (GPT-4, GPT-4-Turbo, Claude2, Llama2) on a 150-case human-eval sample (2,400 labelled responses). Key findings: retrieval strategy and prompt order matter a lot; best realistic setup (GPT-4-Turbo long-context) is 79% correct on the sample; naive closed-book use is unusable (GPT-4-Turbo closed: 9% correct); hallucinations and incorrect numeric reasoning remain common. Use FINANCEBENCH to validate retrieval, prompt, and verification pipelines before deploying LLMs in finance.
Problem Statement
Finance teams need reliable, verifiable answers from LLMs on company filings. Existing QA datasets are not grounded in real analyst tasks or retrieval workflows. The field lacks an open-book benchmark that measures retrieval + reasoning on real financial documents.
Main Contribution
FINANCEBENCH dataset: 10,231 question-answer-evidence triplets across 40 companies and 360 filings (10Ks, 10Qs, 8Ks, earnings) covering 2015–2023.
Three question types and taxonomy: domain-relevant, novel-generated, and metrics-generated questions with labels for numerical/logical/extractive reasoning.
Key Findings
FINANCEBENCH contains 10,231 curated QA-evidence triplets.
Models perform poorly without retrieval: GPT-4-Turbo closed-book correct rate was very low.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GPT-4-Turbo (Closed Book) correct rate | 9% | — | — | Human eval sample (n=150) | Table 2: GPT-4-Turbo Closed Book 14/150 correct (9%) | Table 2 |
| GPT-4-Turbo (LongContext) correct rate | 79% | — | — | Human eval sample (n=150) | Table 2: GPT-4-Turbo Long Context 118/150 correct (79%) | Table 2 |
What To Try In 7 Days
Run FINANCEBENCH's 150-case open-source sample against your model configuration to get a quick baseline.
Compare shared vs per-document vector stores and measure correct/incorrect trade-offs.
Test both Context-First and Context-Last prompts on long documents and prefer Context-First for filings-in-context setups.
Reproducibility
Risks & Boundaries
Limitations
Single-turn questions only; no conversational multi-turn evaluation (Sec 6)
Only public filings and public companies; excludes private documents and some analyst sources (Sec 6)
When Not To Use
When your use case requires multi-turn interactive analysis
When you must handle private or proprietary documents not present in FINANCEBENCH
Failure Modes
Hallucinations: plausible but evidence-contradicting answers
Incorrect numeric calculations or wrong units

