Overview
The paper provides working datasets, code pointers, and concrete experiments showing reproducible gains; results are judged partly by GPT-4 which is an automated judge and carries judge-bias caveats.
Citations5
Evidence Strength0.70
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
You can improve finance-specific LLM outputs quickly and cheaply by combining retrieval-based context with compact instruction fine-tuning, giving better factual answers and sourceable outputs without full model re-pretraining.
Who Should Care
Summary TLDR
This work builds three practical artifacts for financial LLMs: (1) IDEA-FinBench — a bilingual benchmark using CPA and CFA exam questions (4.6k items) to measure finance knowledge; (2) IDEA-FinKER — a two-paradigm knowledge injection framework (soft: retrieval-based few-shot; hard: instruction fine-tuning) and a 300k-item Chinese FinCorpus; (3) IDEA-FinQA — an LLM-driven retrieval QA system (agents + embedding/text search) with a new FinFact fact-checking dataset. Experiments show GPT-4 leads on the benchmark, domain-injection helps weaker models (up to ~9% absolute on CPA), and the retrieval QA system achieves a 70% factual win rate by a GPT-4 judge on FinFact.
Problem Statement
Finance needs LLMs that are accurate, up-to-date, and expert-level. Off-the-shelf LLMs hallucinate, training updates are costly, and existing financial benchmarks are limited. The paper asks: how to measure finance knowledge, inject factual finance knowledge fast and cheaply, and build a retrieval-backed QA system that cites sources.
Main Contribution
IDEA-FinBench: bilingual (Chinese/English) benchmark from CPA and CFA exams (4,617 questions) covering 16 subjects and four question formats, with a modular evaluation suite.
IDEA-FinKER: a two-pronged knowledge-injection framework — soft-injecting (retrieval-based few-shot using FinCorpus) and hard-injecting (LoRA instruction fine-tuning) — plus a cleaned FinCorpus (~300k unique Chinese exam questions).
Key Findings
GPT-4 leads on IDEA-FinBench across subjects.
Domain knowledge injection (IDEA-FinKER) raises accuracy for weaker models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | CPA-SA 62.38%; CPA-MA 45.27%; CFA-L1 84.26%; CFA-L2 60.84% | Random (CPA-SA 25%, CPA-MA 10%, CFA 33.33%) | — | IDEA-FinBench test set | Table 3.3 | Table 3.3 |
| Accuracy | CPA-SA 78.71%; CPA-MA 62.35%; CFA-L1 75.49%; CFA-L2 53.87% | Various general models in Table 3.3 | — | IDEA-FinBench test set | Table 3.3 | Table 3.3 |
What To Try In 7 Days
Run IDEA-FinBench (public code) on your LLM to measure finance knowledge gaps.
Add a retrieval layer (embedding DB + top-5 cosine) to provide context for queries.
Apply a small LoRA instruction-tuning pass on a few finance instructions and validate on CPA-style QA.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
FinCorpus is Chinese-heavy; gains are stronger on Chinese CPA tasks than on English CFA.
Automatic judging (GPT-4) may introduce bias in factuality evaluation.
When Not To Use
Do not rely on this pipeline alone for high-stakes investment or trading decisions without human oversight.
Avoid using retrieval-only context when retrieval sources cannot be verified or are behind paywalls.
Failure Modes
Bad or missing retrieval leads to confident but wrong answers (hallucination despite RAG).
Domain fine-tuning can damage a base model's general abilities if instruction data is misaligned.

