FINANCEBENCH: 10,231 open-book financial QA cases to stress-test LLMs

November 20, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.4

Cost Impact Score

0.25

Citation Count

11

Authors

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, Bertie Vidgen

Links

Abstract / PDF

Why It Matters For Business

Out-of-the-box LLMs often fail on firm-specific financial questions. Firms must validate retrieval, prompt order, and verification steps before trusting outputs in decisions.

Summary TLDR

FINANCEBENCH is a 10,231-item open-book benchmark of financial questions, answers, and evidence covering 40 U.S. public companies and 360 filings. The authors evaluate 16 model+retrieval setups (GPT-4, GPT-4-Turbo, Claude2, Llama2) on a 150-case human-eval sample (2,400 labelled responses). Key findings: retrieval strategy and prompt order matter a lot; best realistic setup (GPT-4-Turbo long-context) is 79% correct on the sample; naive closed-book use is unusable (GPT-4-Turbo closed: 9% correct); hallucinations and incorrect numeric reasoning remain common. Use FINANCEBENCH to validate retrieval, prompt, and verification pipelines before deploying LLMs in finance.

Problem Statement

Finance teams need reliable, verifiable answers from LLMs on company filings. Existing QA datasets are not grounded in real analyst tasks or retrieval workflows. The field lacks an open-book benchmark that measures retrieval + reasoning on real financial documents.

Main Contribution

FINANCEBENCH dataset: 10,231 question-answer-evidence triplets across 40 companies and 360 filings (10Ks, 10Qs, 8Ks, earnings) covering 2015–2023.

Three question types and taxonomy: domain-relevant, novel-generated, and metrics-generated questions with labels for numerical/logical/extractive reasoning.

Human-eval sample: 150 diverse cases from the dataset; 16 model+retrieval configurations evaluated (2,400 responses manually labeled).

Empirical findings: retrieval method, long-context windows, and prompt order strongly affect accuracy; hallucinations and numeric errors remain frequent.

Key Findings

FINANCEBENCH contains 10,231 curated QA-evidence triplets.

Numbers10,231 cases; 360 documents; 40 companies

Models perform poorly without retrieval: GPT-4-Turbo closed-book correct rate was very low.

NumbersGPT-4-Turbo Closed Book: 9% correct (n=150)

Long context and accurate retrieval substantially raise accuracy.

NumbersGPT-4-Turbo LongContext: 79% correct; Oracle: 85% correct

Retrieval architecture matters: per-document stores beat a single shared store.

NumbersGPT-4-Turbo: Single per-doc store 50% vs Shared store 19% correct

Prompt order affects long-context performance strongly.

NumbersLongContext Context-First vs Context-Last: GPT-4-Turbo 78% vs 25%; Claude2 76% vs 37%

Hallucinations and incorrect reasoning are common and model-dependent.

NumbersOverall evaluated responses: 47% correct, 26% incorrect, 27% failures (n=1200)

Results

GPT-4-Turbo (Closed Book) correct rate

Value9%

GPT-4-Turbo (LongContext) correct rate

Value79%

GPT-4-Turbo (Oracle) correct rate

Value85%

GPT-4-Turbo single vs shared vector store

Value50% vs 19% correct

Overall across evaluated configs

Value47% correct, 26% incorrect, 27% failed

Who Should Care

What To Try In 7 Days

Run FINANCEBENCH's 150-case open-source sample against your model configuration to get a quick baseline.

Compare shared vs per-document vector stores and measure correct/incorrect trade-offs.

Test both Context-First and Context-Last prompts on long documents and prefer Context-First for filings-in-context setups.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single-turn questions only; no conversational multi-turn evaluation (Sec 6)
  • Only public filings and public companies; excludes private documents and some analyst sources (Sec 6)
  • Some gold answers can be ambiguous depending on analyst assumptions (Sec 6)
  • Long-context prompts were truncated for very long filings, which can hide retrieval failure modes (Sec 4)

When Not To Use

  • When your use case requires multi-turn interactive analysis
  • When you must handle private or proprietary documents not present in FINANCEBENCH
  • For direct cross-company comparative questions across two full filings

Failure Modes

  • Hallucinations: plausible but evidence-contradicting answers
  • Incorrect numeric calculations or wrong units
  • Refusals where a model could have answered with retrieval tuning
  • Failure to retrieve the correct passage when using shared indexes

Core Entities

Models

  • GPT-4
  • GPT-4-Turbo
  • Claude2
  • Llama2

Metrics

  • percent_correct
  • percent_incorrect
  • percent_failed

Datasets

  • FINANCEBENCH

Benchmarks

  • FinQA
  • ConvFinQA
  • TAT-QA