FinLLMs: Use formulas + LLMs to auto-generate QA datasets for financial numerical reasoning

Overview

Decision SnapshotNeeds Validation

FinLLMs produces verifiable synthetic QA pairs that boost model accuracy modestly; validate units and distribution before production use.

Citations1

Evidence Strength0.75

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Ziqiang Yuan, Kaiyuan Wang, Shoutai Zhu, Ye Yuan, Jingya Zhou, Yanlin Zhu, Wenqi Wei

Links

Abstract / PDF

Why It Matters For Business

You can scale financial QA training data cheaply by programmatically generating tables, text, and formula-backed answers; this lowers reliance on costly expert annotation while improving model accuracy on financial numerical tasks.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder Engineering Lead

Summary TLDR

FinLLMs builds large-scale financial QA data automatically. The authors collect 21 core accounting formulas, build a variable graph, expand formulas by graph traversal, and use GPT-3.5 to generate paired tables, text, and executable DSL programs (small domain-specific math programs). Training standard retriever-generator models (FinQANet, DyRRen) on a 15k-version FinLLMs dataset improves execution and program accuracy by ~2%+ versus FinQA/TAT-QA. The method reduces manual labeling needs but has limits: simpler DSL programs (≤4 steps), distribution mismatches with human data, occasional unit errors, and privacy risks from LLM-based synthesis.

Problem Statement

Financial QA requires reading mixed tables and long text and performing exact numeric calculations. Manual annotation is expensive because annotators need domain knowledge and must label numbers and arithmetic. The paper asks: can we cheaply generate high-quality financial QA data automatically so models learn numerical reasoning over tabular and textual reports?

Main Contribution

A pipeline (FinLLMs) that (1) collects common financial formulas, (2) builds a variable-centered graph and expands formulas by graph traversals, and (3) uses GPT-3.5 to synthesize tables, supporting text, and executable DSL programs.

A DSL and dataset-generation process that produces paired examples where answers are computed from formulas, allowing automated, verifiable answers.

Key Findings

Training with FinLLMs synthetic data improves model accuracy versus FinQA.

NumbersEA +2.01% and PA +3.09% (FinQANet BERT: 53.01 vs 50.00 EA; 51.09 vs 48.00 PA)

Practical UseAdd FinLLMs examples to training to gain small but consistent accuracy boosts on financial numerical QA models.

Evidence RefTable 1; Section 4.4

FinLLMs 15k dataset contains 15,361 examples split ~75/10/15.

Numbers15,361 examples; 6,676 text-only; 8,685 table-only

Practical UseYou can reproduce a mid-sized training set without manual labeling using the described pipeline.

Evidence RefSection 4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	53.01% (trained on FinLLMs 15k)	50.00% (trained on FinQA)	+3.01% EA	FinQA test set	Table 1; Section 4.4	Table 1
Accuracy	65.39% (trained on FinLLMs 15k)	63.30% (trained on FinQA)	+2.09% EA	FinQA test set	Table 1; Section 4.4	Table 1

What To Try In 7 Days

Run the pipeline to generate a small FinLLMs subset (use GPT-3.5 with provided prompts).

Train FinQANet or DyRRen with added synthetic examples and compare EA/PA to your current baseline.

Inspect a random sample of generated examples for unit errors and adjust prompts to preserve units.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Generated DSL programs limited to at most 4 steps, so complex multi-step financial reasoning may be underrepresented.

Synthetic data distribution differs from human-labeled FinQA (supporting facts and complexity), which can bias models.

When Not To Use

When legal or compliance rules forbid synthetic financial data or require strict provenance.

When target tasks require multi-step (>4) arithmetic reasoning not supported by the DSL.

Failure Modes

Retriever selects wrong supporting facts due to crude numeric matching.

LLM-generated text omits or misstates units, leading to numeric errors.

Core Entities

Models

GPT-3.5GPT-4VICUNA-33bFinQANetDyRRenBERT-base-uncasedRoBERTa-largeLLaMA2-13BTagOp/ TapOp

Metrics

Accuracy

Datasets

FinLLMsFinQATAT-QA

Benchmarks

FinQATAT-QA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Training with FinLLMs synthetic data improves model accuracy versus FinQA.

FinLLMs 15k dataset contains 15,361 examples split ~75/10/15.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

Train detectors by teaching models with high-quality fake answers

Key finding

TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

Key finding