Overview
FinLLMs produces verifiable synthetic QA pairs that boost model accuracy modestly; validate units and distribution before production use.
Citations1
Evidence Strength0.75
Confidence0.78
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
You can scale financial QA training data cheaply by programmatically generating tables, text, and formula-backed answers; this lowers reliance on costly expert annotation while improving model accuracy on financial numerical tasks.
Who Should Care
Summary TLDR
FinLLMs builds large-scale financial QA data automatically. The authors collect 21 core accounting formulas, build a variable graph, expand formulas by graph traversal, and use GPT-3.5 to generate paired tables, text, and executable DSL programs (small domain-specific math programs). Training standard retriever-generator models (FinQANet, DyRRen) on a 15k-version FinLLMs dataset improves execution and program accuracy by ~2%+ versus FinQA/TAT-QA. The method reduces manual labeling needs but has limits: simpler DSL programs (≤4 steps), distribution mismatches with human data, occasional unit errors, and privacy risks from LLM-based synthesis.
Problem Statement
Financial QA requires reading mixed tables and long text and performing exact numeric calculations. Manual annotation is expensive because annotators need domain knowledge and must label numbers and arithmetic. The paper asks: can we cheaply generate high-quality financial QA data automatically so models learn numerical reasoning over tabular and textual reports?
Main Contribution
A pipeline (FinLLMs) that (1) collects common financial formulas, (2) builds a variable-centered graph and expands formulas by graph traversals, and (3) uses GPT-3.5 to synthesize tables, supporting text, and executable DSL programs.
A DSL and dataset-generation process that produces paired examples where answers are computed from formulas, allowing automated, verifiable answers.
Key Findings
Training with FinLLMs synthetic data improves model accuracy versus FinQA.
FinLLMs 15k dataset contains 15,361 examples split ~75/10/15.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 53.01% (trained on FinLLMs 15k) | 50.00% (trained on FinQA) | +3.01% EA | FinQA test set | Table 1; Section 4.4 | Table 1 |
| Accuracy | 65.39% (trained on FinLLMs 15k) | 63.30% (trained on FinQA) | +2.09% EA | FinQA test set | Table 1; Section 4.4 | Table 1 |
What To Try In 7 Days
Run the pipeline to generate a small FinLLMs subset (use GPT-3.5 with provided prompts).
Train FinQANet or DyRRen with added synthetic examples and compare EA/PA to your current baseline.
Inspect a random sample of generated examples for unit errors and adjust prompts to preserve units.
Reproducibility
Risks & Boundaries
Limitations
Generated DSL programs limited to at most 4 steps, so complex multi-step financial reasoning may be underrepresented.
Synthetic data distribution differs from human-labeled FinQA (supporting facts and complexity), which can bias models.
When Not To Use
When legal or compliance rules forbid synthetic financial data or require strict provenance.
When target tasks require multi-step (>4) arithmetic reasoning not supported by the DSL.
Failure Modes
Retriever selects wrong supporting facts due to crude numeric matching.
LLM-generated text omits or misstates units, leading to numeric errors.

