Three practical tools for making LLMs more factual in finance: a benchmark, an injection framework, and a retrieval QA system

Overview

Decision SnapshotNeeds Validation

The paper provides working datasets, code pointers, and concrete experiments showing reproducible gains; results are judged partly by GPT-4 which is an automated judge and carries judge-bias caveats.

Citations5

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Cehao Yang, Chengjin Xu, Yiyan Qi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can improve finance-specific LLM outputs quickly and cheaply by combining retrieval-based context with compact instruction fine-tuning, giving better factual answers and sourceable outputs without full model re-pretraining.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

This work builds three practical artifacts for financial LLMs: (1) IDEA-FinBench — a bilingual benchmark using CPA and CFA exam questions (4.6k items) to measure finance knowledge; (2) IDEA-FinKER — a two-paradigm knowledge injection framework (soft: retrieval-based few-shot; hard: instruction fine-tuning) and a 300k-item Chinese FinCorpus; (3) IDEA-FinQA — an LLM-driven retrieval QA system (agents + embedding/text search) with a new FinFact fact-checking dataset. Experiments show GPT-4 leads on the benchmark, domain-injection helps weaker models (up to ~9% absolute on CPA), and the retrieval QA system achieves a 70% factual win rate by a GPT-4 judge on FinFact.

Problem Statement

Finance needs LLMs that are accurate, up-to-date, and expert-level. Off-the-shelf LLMs hallucinate, training updates are costly, and existing financial benchmarks are limited. The paper asks: how to measure finance knowledge, inject factual finance knowledge fast and cheaply, and build a retrieval-backed QA system that cites sources.

Main Contribution

IDEA-FinBench: bilingual (Chinese/English) benchmark from CPA and CFA exams (4,617 questions) covering 16 subjects and four question formats, with a modular evaluation suite.

IDEA-FinKER: a two-pronged knowledge-injection framework — soft-injecting (retrieval-based few-shot using FinCorpus) and hard-injecting (LoRA instruction fine-tuning) — plus a cleaned FinCorpus (~300k unique Chinese exam questions).

Key Findings

GPT-4 leads on IDEA-FinBench across subjects.

NumbersCFA-L1 accuracy 84.26%; CPA-SA 62.38%

Practical UseUse GPT-4 (API) when accuracy on English finance exams matters; expect lower performance on Chinese CPA.

Evidence RefTable 3.3

Domain knowledge injection (IDEA-FinKER) raises accuracy for weaker models.

NumbersBaichuan2-7B-Chat CPA-SA +8.35 pts; Qwen-7B-Chat +9.05 pts; Yi-6B-Chat +7.29 pts

Practical UseApply combined retrieval few-shot plus targeted LoRA fine-tuning to get multi-point gains without full re-pretraining.

Evidence RefTable 4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	CPA-SA 62.38%; CPA-MA 45.27%; CFA-L1 84.26%; CFA-L2 60.84%	Random (CPA-SA 25%, CPA-MA 10%, CFA 33.33%)	—	IDEA-FinBench test set	Table 3.3	Table 3.3
Accuracy	CPA-SA 78.71%; CPA-MA 62.35%; CFA-L1 75.49%; CFA-L2 53.87%	Various general models in Table 3.3	—	IDEA-FinBench test set	Table 3.3	Table 3.3

What To Try In 7 Days

Run IDEA-FinBench (public code) on your LLM to measure finance knowledge gaps.

Add a retrieval layer (embedding DB + top-5 cosine) to provide context for queries.

Apply a small LoRA instruction-tuning pass on a few finance instructions and validate on CPA-style QA.

Agent Features

Memory

retrieval memory (external FinCorpus/paragraph store)

Planning

pipeline with staged agents for rewrite → retrieve → extract → generate

Tool Use

embedding DB (ChromaDB)text-index searchweb crawlers for news and reports

Frameworks

LLaMA-FactoryLangChainLlamaIndex

Is Agentic

Yes

Architectures

LLM-based agents (query rewriter, intention detector, extractor, generator)

Collaboration

multi-agent pipeline (sequential specialized agents)

Optimization Features

Infra Optimization

LoRA

Model Optimization

LoRA

System Optimization

flash-attention used in fine-tuning; bfloat16 training

Training Optimization

LoRA

Inference Optimization

not explicitly described beyond standard LLM inference

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/IDEA-FinAI/IDEAFinBench

Data URLs

https://github.com/IDEA-FinAI/IDEAFinBench

Risks & Boundaries

Limitations

FinCorpus is Chinese-heavy; gains are stronger on Chinese CPA tasks than on English CFA.

Automatic judging (GPT-4) may introduce bias in factuality evaluation.

When Not To Use

Do not rely on this pipeline alone for high-stakes investment or trading decisions without human oversight.

Avoid using retrieval-only context when retrieval sources cannot be verified or are behind paywalls.

Failure Modes

Bad or missing retrieval leads to confident but wrong answers (hallucination despite RAG).

Domain fine-tuning can damage a base model's general abilities if instruction data is misaligned.

Core Entities

Models

GPT-4ChatGPTLLaMA-2-7BLLaMA-2-13BChinese-Alpaca-2-7BChinese-Alpaca-2-13BChatGLM3-6BBaichuan2-7BBaichuan2-13BQwen-7BQwen-14BYi-6BYi-34BDISC-FinLLMTongyi-Finance-14BIDEA-FinLLM

Metrics

Accuracywinning rate (judge-based)

Datasets

IDEA-FinBenchFinCorpusFinFactCPA exam questionsCFA Level 1 & 2 questions

Benchmarks

IDEA-FinBenchFinFact

Context Entities

Models

GPT-3.5 / gpt-4-turbo-preview (used as judge/for generation)

Metrics

precision/recall not explicitly reported

Datasets

FinEvalC-Eval (reference)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 leads on IDEA-FinBench across subjects.

Domain knowledge injection (IDEA-FinKER) raises accuracy for weaker models.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding