FinLLMs: Use formulas + LLMs to auto-generate QA datasets for financial numerical reasoning

January 19, 20247 min

Overview

Decision SnapshotNeeds Validation

FinLLMs produces verifiable synthetic QA pairs that boost model accuracy modestly; validate units and distribution before production use.

Citations1

Evidence Strength0.75

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Ziqiang Yuan, Kaiyuan Wang, Shoutai Zhu, Ye Yuan, Jingya Zhou, Yanlin Zhu, Wenqi Wei

Links

Abstract / PDF

Why It Matters For Business

You can scale financial QA training data cheaply by programmatically generating tables, text, and formula-backed answers; this lowers reliance on costly expert annotation while improving model accuracy on financial numerical tasks.

Who Should Care

Summary TLDR

FinLLMs builds large-scale financial QA data automatically. The authors collect 21 core accounting formulas, build a variable graph, expand formulas by graph traversal, and use GPT-3.5 to generate paired tables, text, and executable DSL programs (small domain-specific math programs). Training standard retriever-generator models (FinQANet, DyRRen) on a 15k-version FinLLMs dataset improves execution and program accuracy by ~2%+ versus FinQA/TAT-QA. The method reduces manual labeling needs but has limits: simpler DSL programs (≤4 steps), distribution mismatches with human data, occasional unit errors, and privacy risks from LLM-based synthesis.

Problem Statement

Financial QA requires reading mixed tables and long text and performing exact numeric calculations. Manual annotation is expensive because annotators need domain knowledge and must label numbers and arithmetic. The paper asks: can we cheaply generate high-quality financial QA data automatically so models learn numerical reasoning over tabular and textual reports?

Main Contribution

A pipeline (FinLLMs) that (1) collects common financial formulas, (2) builds a variable-centered graph and expands formulas by graph traversals, and (3) uses GPT-3.5 to synthesize tables, supporting text, and executable DSL programs.

A DSL and dataset-generation process that produces paired examples where answers are computed from formulas, allowing automated, verifiable answers.

Key Findings

Training with FinLLMs synthetic data improves model accuracy versus FinQA.

NumbersEA +2.01% and PA +3.09% (FinQANet BERT: 53.01 vs 50.00 EA; 51.09 vs 48.00 PA)

Practical UseAdd FinLLMs examples to training to gain small but consistent accuracy boosts on financial numerical QA models.

Evidence RefTable 1; Section 4.4

FinLLMs 15k dataset contains 15,361 examples split ~75/10/15.

Numbers15,361 examples; 6,676 text-only; 8,685 table-only

Practical UseYou can reproduce a mid-sized training set without manual labeling using the described pipeline.

Evidence RefSection 4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy53.01% (trained on FinLLMs 15k)50.00% (trained on FinQA)+3.01% EAFinQA test setTable 1; Section 4.4Table 1
Accuracy65.39% (trained on FinLLMs 15k)63.30% (trained on FinQA)+2.09% EAFinQA test setTable 1; Section 4.4Table 1

What To Try In 7 Days

Run the pipeline to generate a small FinLLMs subset (use GPT-3.5 with provided prompts).

Train FinQANet or DyRRen with added synthetic examples and compare EA/PA to your current baseline.

Inspect a random sample of generated examples for unit errors and adjust prompts to preserve units.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Generated DSL programs limited to at most 4 steps, so complex multi-step financial reasoning may be underrepresented.

Synthetic data distribution differs from human-labeled FinQA (supporting facts and complexity), which can bias models.

When Not To Use

When legal or compliance rules forbid synthetic financial data or require strict provenance.

When target tasks require multi-step (>4) arithmetic reasoning not supported by the DSL.

Failure Modes

Retriever selects wrong supporting facts due to crude numeric matching.

LLM-generated text omits or misstates units, leading to numeric errors.

Core Entities

Models

GPT-3.5GPT-4VICUNA-33bFinQANetDyRRenBERT-base-uncasedRoBERTa-largeLLaMA2-13BTagOp/ TapOp

Metrics

Accuracy

Datasets

FinLLMsFinQATAT-QA

Benchmarks

FinQATAT-QA