Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
You can scale financial QA training data cheaply by programmatically generating tables, text, and formula-backed answers; this lowers reliance on costly expert annotation while improving model accuracy on financial numerical tasks.
Summary TLDR
FinLLMs builds large-scale financial QA data automatically. The authors collect 21 core accounting formulas, build a variable graph, expand formulas by graph traversal, and use GPT-3.5 to generate paired tables, text, and executable DSL programs (small domain-specific math programs). Training standard retriever-generator models (FinQANet, DyRRen) on a 15k-version FinLLMs dataset improves execution and program accuracy by ~2%+ versus FinQA/TAT-QA. The method reduces manual labeling needs but has limits: simpler DSL programs (≤4 steps), distribution mismatches with human data, occasional unit errors, and privacy risks from LLM-based synthesis.
Problem Statement
Financial QA requires reading mixed tables and long text and performing exact numeric calculations. Manual annotation is expensive because annotators need domain knowledge and must label numbers and arithmetic. The paper asks: can we cheaply generate high-quality financial QA data automatically so models learn numerical reasoning over tabular and textual reports?
Main Contribution
A pipeline (FinLLMs) that (1) collects common financial formulas, (2) builds a variable-centered graph and expands formulas by graph traversals, and (3) uses GPT-3.5 to synthesize tables, supporting text, and executable DSL programs.
A DSL and dataset-generation process that produces paired examples where answers are computed from formulas, allowing automated, verifiable answers.
Empirical results showing synthetic data (15k version) improves multiple retriever-generator models' execution accuracy and program accuracy by at least ~2% on FinQA-style tests, plus ablations on traversal steps, data size, and prompt style.
Key Findings
Training with FinLLMs synthetic data improves model accuracy versus FinQA.
FinLLMs 15k dataset contains 15,361 examples split ~75/10/15.
Synthetic examples are mostly correct when generated from formulas.
Large LLMs get larger gains from FinLLMs synthetic data than from FinQA human data in one-shot tests.
Dataset complexity and composition affect gains; traversal steps and size matter.
Results
Accuracy
Accuracy
Accuracy
Synthetic example correctness (sampled)
Who Should Care
What To Try In 7 Days
Run the pipeline to generate a small FinLLMs subset (use GPT-3.5 with provided prompts).
Train FinQANet or DyRRen with added synthetic examples and compare EA/PA to your current baseline.
Inspect a random sample of generated examples for unit errors and adjust prompts to preserve units.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Generated DSL programs limited to at most 4 steps, so complex multi-step financial reasoning may be underrepresented.
- Synthetic data distribution differs from human-labeled FinQA (supporting facts and complexity), which can bias models.
- LLM extraction sometimes omits numerical units (e.g., 'millions'), causing calculation errors.
- Privacy risks: LLM-based synthesis can leak learned data and may not meet strict regulatory requirements.
When Not To Use
- When legal or compliance rules forbid synthetic financial data or require strict provenance.
- When target tasks require multi-step (>4) arithmetic reasoning not supported by the DSL.
- If you cannot audit generated numbers and units for correctness.
Failure Modes
- Retriever selects wrong supporting facts due to crude numeric matching.
- LLM-generated text omits or misstates units, leading to numeric errors.
- Overfitting to synthetic distribution causes degradation on real-world report distributions.
- Excessive graph traversal or too-large synthetic sets reduce net gains (data quality vs quantity tradeoff).
Core Entities
Models
- GPT-3.5
- GPT-4
- VICUNA-33b
- FinQANet
- DyRRen
- BERT-base-uncased
- RoBERTa-large
- LLaMA2-13B
- TagOp/ TapOp
Metrics
- Accuracy
Datasets
- FinLLMs
- FinQA
- TAT-QA
Benchmarks
- FinQA
- TAT-QA

