Three practical tools for making LLMs more factual in finance: a benchmark, an injection framework, and a retrieval QA system

June 29, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper provides working datasets, code pointers, and concrete experiments showing reproducible gains; results are judged partly by GPT-4 which is an automated judge and carries judge-bias caveats.

Citations5

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Cehao Yang, Chengjin Xu, Yiyan Qi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can improve finance-specific LLM outputs quickly and cheaply by combining retrieval-based context with compact instruction fine-tuning, giving better factual answers and sourceable outputs without full model re-pretraining.

Who Should Care

Summary TLDR

This work builds three practical artifacts for financial LLMs: (1) IDEA-FinBench — a bilingual benchmark using CPA and CFA exam questions (4.6k items) to measure finance knowledge; (2) IDEA-FinKER — a two-paradigm knowledge injection framework (soft: retrieval-based few-shot; hard: instruction fine-tuning) and a 300k-item Chinese FinCorpus; (3) IDEA-FinQA — an LLM-driven retrieval QA system (agents + embedding/text search) with a new FinFact fact-checking dataset. Experiments show GPT-4 leads on the benchmark, domain-injection helps weaker models (up to ~9% absolute on CPA), and the retrieval QA system achieves a 70% factual win rate by a GPT-4 judge on FinFact.

Problem Statement

Finance needs LLMs that are accurate, up-to-date, and expert-level. Off-the-shelf LLMs hallucinate, training updates are costly, and existing financial benchmarks are limited. The paper asks: how to measure finance knowledge, inject factual finance knowledge fast and cheaply, and build a retrieval-backed QA system that cites sources.

Main Contribution

IDEA-FinBench: bilingual (Chinese/English) benchmark from CPA and CFA exams (4,617 questions) covering 16 subjects and four question formats, with a modular evaluation suite.

IDEA-FinKER: a two-pronged knowledge-injection framework — soft-injecting (retrieval-based few-shot using FinCorpus) and hard-injecting (LoRA instruction fine-tuning) — plus a cleaned FinCorpus (~300k unique Chinese exam questions).

Key Findings

GPT-4 leads on IDEA-FinBench across subjects.

NumbersCFA-L1 accuracy 84.26%; CPA-SA 62.38%

Practical UseUse GPT-4 (API) when accuracy on English finance exams matters; expect lower performance on Chinese CPA.

Evidence RefTable 3.3

Domain knowledge injection (IDEA-FinKER) raises accuracy for weaker models.

NumbersBaichuan2-7B-Chat CPA-SA +8.35 pts; Qwen-7B-Chat +9.05 pts; Yi-6B-Chat +7.29 pts

Practical UseApply combined retrieval few-shot plus targeted LoRA fine-tuning to get multi-point gains without full re-pretraining.

Evidence RefTable 4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyCPA-SA 62.38%; CPA-MA 45.27%; CFA-L1 84.26%; CFA-L2 60.84%Random (CPA-SA 25%, CPA-MA 10%, CFA 33.33%)IDEA-FinBench test setTable 3.3Table 3.3
AccuracyCPA-SA 78.71%; CPA-MA 62.35%; CFA-L1 75.49%; CFA-L2 53.87%Various general models in Table 3.3IDEA-FinBench test setTable 3.3Table 3.3

What To Try In 7 Days

Run IDEA-FinBench (public code) on your LLM to measure finance knowledge gaps.

Add a retrieval layer (embedding DB + top-5 cosine) to provide context for queries.

Apply a small LoRA instruction-tuning pass on a few finance instructions and validate on CPA-style QA.

Agent Features

Memory
retrieval memory (external FinCorpus/paragraph store)
Planning
pipeline with staged agents for rewrite → retrieve → extract → generate
Tool Use
embedding DB (ChromaDB)text-index searchweb crawlers for news and reports
Frameworks
LLaMA-FactoryLangChainLlamaIndex
Is Agentic

Yes

Architectures
LLM-based agents (query rewriter, intention detector, extractor, generator)
Collaboration
multi-agent pipeline (sequential specialized agents)

Optimization Features

Infra Optimization
LoRA
Model Optimization
LoRA
System Optimization
flash-attention used in fine-tuning; bfloat16 training
Training Optimization
LoRA
Inference Optimization
not explicitly described beyond standard LLM inference

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

FinCorpus is Chinese-heavy; gains are stronger on Chinese CPA tasks than on English CFA.

Automatic judging (GPT-4) may introduce bias in factuality evaluation.

When Not To Use

Do not rely on this pipeline alone for high-stakes investment or trading decisions without human oversight.

Avoid using retrieval-only context when retrieval sources cannot be verified or are behind paywalls.

Failure Modes

Bad or missing retrieval leads to confident but wrong answers (hallucination despite RAG).

Domain fine-tuning can damage a base model's general abilities if instruction data is misaligned.

Core Entities

Models

GPT-4ChatGPTLLaMA-2-7BLLaMA-2-13BChinese-Alpaca-2-7BChinese-Alpaca-2-13BChatGLM3-6BBaichuan2-7BBaichuan2-13BQwen-7BQwen-14BYi-6BYi-34BDISC-FinLLMTongyi-Finance-14BIDEA-FinLLM

Metrics

Accuracywinning rate (judge-based)

Datasets

IDEA-FinBenchFinCorpusFinFactCPA exam questionsCFA Level 1 & 2 questions

Benchmarks

IDEA-FinBenchFinFact

Context Entities

Models

GPT-3.5 / gpt-4-turbo-preview (used as judge/for generation)

Metrics

precision/recall not explicitly reported

Datasets

FinEvalC-Eval (reference)