Three practical tools for making LLMs more factual in finance: a benchmark, an injection framework, and a retrieval QA system

June 29, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

5

Authors

Cehao Yang, Chengjin Xu, Yiyan Qi

Links

Abstract / PDF

Why It Matters For Business

You can improve finance-specific LLM outputs quickly and cheaply by combining retrieval-based context with compact instruction fine-tuning, giving better factual answers and sourceable outputs without full model re-pretraining.

Summary TLDR

This work builds three practical artifacts for financial LLMs: (1) IDEA-FinBench — a bilingual benchmark using CPA and CFA exam questions (4.6k items) to measure finance knowledge; (2) IDEA-FinKER — a two-paradigm knowledge injection framework (soft: retrieval-based few-shot; hard: instruction fine-tuning) and a 300k-item Chinese FinCorpus; (3) IDEA-FinQA — an LLM-driven retrieval QA system (agents + embedding/text search) with a new FinFact fact-checking dataset. Experiments show GPT-4 leads on the benchmark, domain-injection helps weaker models (up to ~9% absolute on CPA), and the retrieval QA system achieves a 70% factual win rate by a GPT-4 judge on FinFact.

Problem Statement

Finance needs LLMs that are accurate, up-to-date, and expert-level. Off-the-shelf LLMs hallucinate, training updates are costly, and existing financial benchmarks are limited. The paper asks: how to measure finance knowledge, inject factual finance knowledge fast and cheaply, and build a retrieval-backed QA system that cites sources.

Main Contribution

IDEA-FinBench: bilingual (Chinese/English) benchmark from CPA and CFA exams (4,617 questions) covering 16 subjects and four question formats, with a modular evaluation suite.

IDEA-FinKER: a two-pronged knowledge-injection framework — soft-injecting (retrieval-based few-shot using FinCorpus) and hard-injecting (LoRA instruction fine-tuning) — plus a cleaned FinCorpus (~300k unique Chinese exam questions).

IDEA-FinQA and FinFact: a retrieval-augmented, agent-driven QA pipeline plus FinFact, a 1.5k-item Chinese financial fact-check dataset; system returns answers with citations and shows strong factuality under automatic judging.

Key Findings

GPT-4 leads on IDEA-FinBench across subjects.

NumbersCFA-L1 accuracy 84.26%; CPA-SA 62.38%

Domain knowledge injection (IDEA-FinKER) raises accuracy for weaker models.

NumbersBaichuan2-7B-Chat CPA-SA +8.35 pts; Qwen-7B-Chat +9.05 pts; Yi-6B-Chat +7.29 pts

Combined soft + hard injection works best.

NumbersBest results reported when both paradigms are combined for each base model

IDEA-FinQA achieves strong factual answers judged by GPT-4.

NumbersFactual winning rate 70% vs other models on FinFact

Vertical/fine-tuned financial LLMs don't always outperform general models.

NumbersSome domain-trained financial models underperform general models on benchmark (discussed qualitatively)

Results

Accuracy

ValueCPA-SA 62.38%; CPA-MA 45.27%; CFA-L1 84.26%; CFA-L2 60.84%

BaselineRandom (CPA-SA 25%, CPA-MA 10%, CFA 33.33%)

Accuracy

ValueCPA-SA 78.71%; CPA-MA 62.35%; CFA-L1 75.49%; CFA-L2 53.87%

BaselineVarious general models in Table 3.3

IDEA-FinKER improvement (CPA-SA)

ValueBaichuan2-7B-Chat +8.35 pts; Qwen-7B-Chat +9.05 pts; Yi-6B-Chat +7.29 pts

Baselinevanilla models

IDEA-FinQA factual winning rate

Value70% factual win rate (judge-based)

Baselineother evaluated LLMs

Who Should Care

What To Try In 7 Days

Run IDEA-FinBench (public code) on your LLM to measure finance knowledge gaps.

Add a retrieval layer (embedding DB + top-5 cosine) to provide context for queries.

Apply a small LoRA instruction-tuning pass on a few finance instructions and validate on CPA-style QA.

Agent Features

Memory

  • retrieval memory (external FinCorpus/paragraph store)

Planning

  • pipeline with staged agents for rewrite → retrieve → extract → generate

Tool Use

  • embedding DB (ChromaDB)
  • text-index search
  • web crawlers for news and reports

Frameworks

  • LLaMA-Factory
  • LangChain
  • LlamaIndex

Is Agentic

true

Architectures

  • LLM-based agents (query rewriter, intention detector, extractor, generator)

Collaboration

  • multi-agent pipeline (sequential specialized agents)

Optimization Features

Infra Optimization

  • LoRA

Model Optimization

  • LoRA

System Optimization

  • flash-attention used in fine-tuning; bfloat16 training

Training Optimization

  • LoRA

Inference Optimization

  • not explicitly described beyond standard LLM inference

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • FinCorpus is Chinese-heavy; gains are stronger on Chinese CPA tasks than on English CFA.
  • Automatic judging (GPT-4) may introduce bias in factuality evaluation.
  • Some publicly released financial LLMs degraded after domain training, showing fine-tuning can harm performance.
  • FinCorpus sourcing and cleaning needed heavy heuristics; coverage and labeling quality may vary.

When Not To Use

  • Do not rely on this pipeline alone for high-stakes investment or trading decisions without human oversight.
  • Avoid using retrieval-only context when retrieval sources cannot be verified or are behind paywalls.
  • Not appropriate where millisecond-level latency is required (real-time trading).

Failure Modes

  • Bad or missing retrieval leads to confident but wrong answers (hallucination despite RAG).
  • Domain fine-tuning can damage a base model's general abilities if instruction data is misaligned.
  • Multi-answer (combination) questions are still fragile for some models and retrieval setups.

Core Entities

Models

  • GPT-4
  • ChatGPT
  • LLaMA-2-7B
  • LLaMA-2-13B
  • Chinese-Alpaca-2-7B
  • Chinese-Alpaca-2-13B
  • ChatGLM3-6B
  • Baichuan2-7B
  • Baichuan2-13B
  • Qwen-7B
  • Qwen-14B
  • Yi-6B
  • Yi-34B
  • DISC-FinLLM
  • Tongyi-Finance-14B
  • IDEA-FinLLM

Metrics

  • Accuracy
  • winning rate (judge-based)

Datasets

  • IDEA-FinBench
  • FinCorpus
  • FinFact
  • CPA exam questions
  • CFA Level 1 & 2 questions

Benchmarks

  • IDEA-FinBench
  • FinFact

Context Entities

Models

  • GPT-3.5 / gpt-4-turbo-preview (used as judge/for generation)

Metrics

  • precision/recall not explicitly reported

Datasets

  • FinEval
  • C-Eval (reference)