FinLLMs: Use formulas + LLMs to auto-generate QA datasets for financial numerical reasoning

January 19, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Ziqiang Yuan, Kaiyuan Wang, Shoutai Zhu, Ye Yuan, Jingya Zhou, Yanlin Zhu, Wenqi Wei

Links

Abstract / PDF

Why It Matters For Business

You can scale financial QA training data cheaply by programmatically generating tables, text, and formula-backed answers; this lowers reliance on costly expert annotation while improving model accuracy on financial numerical tasks.

Summary TLDR

FinLLMs builds large-scale financial QA data automatically. The authors collect 21 core accounting formulas, build a variable graph, expand formulas by graph traversal, and use GPT-3.5 to generate paired tables, text, and executable DSL programs (small domain-specific math programs). Training standard retriever-generator models (FinQANet, DyRRen) on a 15k-version FinLLMs dataset improves execution and program accuracy by ~2%+ versus FinQA/TAT-QA. The method reduces manual labeling needs but has limits: simpler DSL programs (≤4 steps), distribution mismatches with human data, occasional unit errors, and privacy risks from LLM-based synthesis.

Problem Statement

Financial QA requires reading mixed tables and long text and performing exact numeric calculations. Manual annotation is expensive because annotators need domain knowledge and must label numbers and arithmetic. The paper asks: can we cheaply generate high-quality financial QA data automatically so models learn numerical reasoning over tabular and textual reports?

Main Contribution

A pipeline (FinLLMs) that (1) collects common financial formulas, (2) builds a variable-centered graph and expands formulas by graph traversals, and (3) uses GPT-3.5 to synthesize tables, supporting text, and executable DSL programs.

A DSL and dataset-generation process that produces paired examples where answers are computed from formulas, allowing automated, verifiable answers.

Empirical results showing synthetic data (15k version) improves multiple retriever-generator models' execution accuracy and program accuracy by at least ~2% on FinQA-style tests, plus ablations on traversal steps, data size, and prompt style.

Key Findings

Training with FinLLMs synthetic data improves model accuracy versus FinQA.

NumbersEA +2.01% and PA +3.09% (FinQANet BERT: 53.01 vs 50.00 EA; 51.09 vs 48.00 PA)

FinLLMs 15k dataset contains 15,361 examples split ~75/10/15.

Numbers15,361 examples; 6,676 text-only; 8,685 table-only

Synthetic examples are mostly correct when generated from formulas.

Numbers97.48% correctness on a 357-sample check

Large LLMs get larger gains from FinLLMs synthetic data than from FinQA human data in one-shot tests.

NumbersGPT-4 EA: 46.87 (FinLLMs) vs 27.99 (FinQA)

Dataset complexity and composition affect gains; traversal steps and size matter.

NumbersPerformance rose with graph traversal steps 0→3 (EA 50.48→51.96 on FinQANet BERT)

Results

Accuracy

Value53.01% (trained on FinLLMs 15k)

Baseline50.00% (trained on FinQA)

Accuracy

Value65.39% (trained on FinLLMs 15k)

Baseline63.30% (trained on FinQA)

Accuracy

Value46.87% (FinLLMs prompts)

Baseline27.99% (FinQA prompts)

Synthetic example correctness (sampled)

Value97.48% correct

Baseline100% ideal (program-derived answers)

Who Should Care

What To Try In 7 Days

Run the pipeline to generate a small FinLLMs subset (use GPT-3.5 with provided prompts).

Train FinQANet or DyRRen with added synthetic examples and compare EA/PA to your current baseline.

Inspect a random sample of generated examples for unit errors and adjust prompts to preserve units.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Generated DSL programs limited to at most 4 steps, so complex multi-step financial reasoning may be underrepresented.
  • Synthetic data distribution differs from human-labeled FinQA (supporting facts and complexity), which can bias models.
  • LLM extraction sometimes omits numerical units (e.g., 'millions'), causing calculation errors.
  • Privacy risks: LLM-based synthesis can leak learned data and may not meet strict regulatory requirements.

When Not To Use

  • When legal or compliance rules forbid synthetic financial data or require strict provenance.
  • When target tasks require multi-step (>4) arithmetic reasoning not supported by the DSL.
  • If you cannot audit generated numbers and units for correctness.

Failure Modes

  • Retriever selects wrong supporting facts due to crude numeric matching.
  • LLM-generated text omits or misstates units, leading to numeric errors.
  • Overfitting to synthetic distribution causes degradation on real-world report distributions.
  • Excessive graph traversal or too-large synthetic sets reduce net gains (data quality vs quantity tradeoff).

Core Entities

Models

  • GPT-3.5
  • GPT-4
  • VICUNA-33b
  • FinQANet
  • DyRRen
  • BERT-base-uncased
  • RoBERTa-large
  • LLaMA2-13B
  • TagOp/ TapOp

Metrics

  • Accuracy

Datasets

  • FinLLMs
  • FinQA
  • TAT-QA

Benchmarks

  • FinQA
  • TAT-QA