FINANCEBENCH: 10,231 open-book financial QA cases to stress-test LLMs

Overview

Decision SnapshotNeeds Validation

The benchmark and manual labels are solid for open-book financial QA; models tested are not yet reliable enough for high-stakes production without verification.

Citations11

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 30%

Novelty: 40%

Authors

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, Bertie Vidgen

Links

Abstract / PDF / Data

Why It Matters For Business

Out-of-the-box LLMs often fail on firm-specific financial questions. Firms must validate retrieval, prompt order, and verification steps before trusting outputs in decisions.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

FINANCEBENCH is a 10,231-item open-book benchmark of financial questions, answers, and evidence covering 40 U.S. public companies and 360 filings. The authors evaluate 16 model+retrieval setups (GPT-4, GPT-4-Turbo, Claude2, Llama2) on a 150-case human-eval sample (2,400 labelled responses). Key findings: retrieval strategy and prompt order matter a lot; best realistic setup (GPT-4-Turbo long-context) is 79% correct on the sample; naive closed-book use is unusable (GPT-4-Turbo closed: 9% correct); hallucinations and incorrect numeric reasoning remain common. Use FINANCEBENCH to validate retrieval, prompt, and verification pipelines before deploying LLMs in finance.

Problem Statement

Finance teams need reliable, verifiable answers from LLMs on company filings. Existing QA datasets are not grounded in real analyst tasks or retrieval workflows. The field lacks an open-book benchmark that measures retrieval + reasoning on real financial documents.

Main Contribution

FINANCEBENCH dataset: 10,231 question-answer-evidence triplets across 40 companies and 360 filings (10Ks, 10Qs, 8Ks, earnings) covering 2015–2023.

Three question types and taxonomy: domain-relevant, novel-generated, and metrics-generated questions with labels for numerical/logical/extractive reasoning.

Key Findings

FINANCEBENCH contains 10,231 curated QA-evidence triplets.

Numbers10,231 cases; 360 documents; 40 companies

Practical UseYou can run targeted open-book tests on realistic financial questions using this dataset.

Evidence RefSec 3

Models perform poorly without retrieval: GPT-4-Turbo closed-book correct rate was very low.

NumbersGPT-4-Turbo Closed Book: 9% correct (n=150)

Practical UseDo not deploy closed-book LLMs for production financial QA; add retrieval or documents in-context.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPT-4-Turbo (Closed Book) correct rate	9%	—	—	Human eval sample (n=150)	Table 2: GPT-4-Turbo Closed Book 14/150 correct (9%)	Table 2
GPT-4-Turbo (LongContext) correct rate	79%	—	—	Human eval sample (n=150)	Table 2: GPT-4-Turbo Long Context 118/150 correct (79%)	Table 2

What To Try In 7 Days

Run FINANCEBENCH's 150-case open-source sample against your model configuration to get a quick baseline.

Compare shared vs per-document vector stores and measure correct/incorrect trade-offs.

Test both Context-First and Context-Last prompts on long documents and prefer Context-First for filings-in-context setups.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/PatronusAI/financebench https://arxiv.org/abs/2311.11944

Risks & Boundaries

Limitations

Single-turn questions only; no conversational multi-turn evaluation (Sec 6)

Only public filings and public companies; excludes private documents and some analyst sources (Sec 6)

When Not To Use

When your use case requires multi-turn interactive analysis

When you must handle private or proprietary documents not present in FINANCEBENCH

Failure Modes

Hallucinations: plausible but evidence-contradicting answers

Incorrect numeric calculations or wrong units

Core Entities

Models

GPT-4GPT-4-TurboClaude2Llama2

Metrics

percent_correctpercent_incorrectpercent_failed

Datasets

FINANCEBENCH

Benchmarks

FinQAConvFinQATAT-QA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FINANCEBENCH contains 10,231 curated QA-evidence triplets.

Models perform poorly without retrieval: GPT-4-Turbo closed-book correct rate was very low.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding