FINANCEBENCH: 10,231 open-book financial QA cases to stress-test LLMs

November 20, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark and manual labels are solid for open-book financial QA; models tested are not yet reliable enough for high-stakes production without verification.

Citations11

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 30%

Novelty: 40%

Authors

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, Bertie Vidgen

Links

Abstract / PDF / Data

Why It Matters For Business

Out-of-the-box LLMs often fail on firm-specific financial questions. Firms must validate retrieval, prompt order, and verification steps before trusting outputs in decisions.

Who Should Care

Summary TLDR

FINANCEBENCH is a 10,231-item open-book benchmark of financial questions, answers, and evidence covering 40 U.S. public companies and 360 filings. The authors evaluate 16 model+retrieval setups (GPT-4, GPT-4-Turbo, Claude2, Llama2) on a 150-case human-eval sample (2,400 labelled responses). Key findings: retrieval strategy and prompt order matter a lot; best realistic setup (GPT-4-Turbo long-context) is 79% correct on the sample; naive closed-book use is unusable (GPT-4-Turbo closed: 9% correct); hallucinations and incorrect numeric reasoning remain common. Use FINANCEBENCH to validate retrieval, prompt, and verification pipelines before deploying LLMs in finance.

Problem Statement

Finance teams need reliable, verifiable answers from LLMs on company filings. Existing QA datasets are not grounded in real analyst tasks or retrieval workflows. The field lacks an open-book benchmark that measures retrieval + reasoning on real financial documents.

Main Contribution

FINANCEBENCH dataset: 10,231 question-answer-evidence triplets across 40 companies and 360 filings (10Ks, 10Qs, 8Ks, earnings) covering 2015–2023.

Three question types and taxonomy: domain-relevant, novel-generated, and metrics-generated questions with labels for numerical/logical/extractive reasoning.

Key Findings

FINANCEBENCH contains 10,231 curated QA-evidence triplets.

Numbers10,231 cases; 360 documents; 40 companies

Practical UseYou can run targeted open-book tests on realistic financial questions using this dataset.

Evidence RefSec 3

Models perform poorly without retrieval: GPT-4-Turbo closed-book correct rate was very low.

NumbersGPT-4-Turbo Closed Book: 9% correct (n=150)

Practical UseDo not deploy closed-book LLMs for production financial QA; add retrieval or documents in-context.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPT-4-Turbo (Closed Book) correct rate9%Human eval sample (n=150)Table 2: GPT-4-Turbo Closed Book 14/150 correct (9%)Table 2
GPT-4-Turbo (LongContext) correct rate79%Human eval sample (n=150)Table 2: GPT-4-Turbo Long Context 118/150 correct (79%)Table 2

What To Try In 7 Days

Run FINANCEBENCH's 150-case open-source sample against your model configuration to get a quick baseline.

Compare shared vs per-document vector stores and measure correct/incorrect trade-offs.

Test both Context-First and Context-Last prompts on long documents and prefer Context-First for filings-in-context setups.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Single-turn questions only; no conversational multi-turn evaluation (Sec 6)

Only public filings and public companies; excludes private documents and some analyst sources (Sec 6)

When Not To Use

When your use case requires multi-turn interactive analysis

When you must handle private or proprietary documents not present in FINANCEBENCH

Failure Modes

Hallucinations: plausible but evidence-contradicting answers

Incorrect numeric calculations or wrong units

Core Entities

Models

GPT-4GPT-4-TurboClaude2Llama2

Metrics

percent_correctpercent_incorrectpercent_failed

Datasets

FINANCEBENCH

Benchmarks

FinQAConvFinQATAT-QA