Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
5
Why It Matters For Business
You can improve finance-specific LLM outputs quickly and cheaply by combining retrieval-based context with compact instruction fine-tuning, giving better factual answers and sourceable outputs without full model re-pretraining.
Summary TLDR
This work builds three practical artifacts for financial LLMs: (1) IDEA-FinBench — a bilingual benchmark using CPA and CFA exam questions (4.6k items) to measure finance knowledge; (2) IDEA-FinKER — a two-paradigm knowledge injection framework (soft: retrieval-based few-shot; hard: instruction fine-tuning) and a 300k-item Chinese FinCorpus; (3) IDEA-FinQA — an LLM-driven retrieval QA system (agents + embedding/text search) with a new FinFact fact-checking dataset. Experiments show GPT-4 leads on the benchmark, domain-injection helps weaker models (up to ~9% absolute on CPA), and the retrieval QA system achieves a 70% factual win rate by a GPT-4 judge on FinFact.
Problem Statement
Finance needs LLMs that are accurate, up-to-date, and expert-level. Off-the-shelf LLMs hallucinate, training updates are costly, and existing financial benchmarks are limited. The paper asks: how to measure finance knowledge, inject factual finance knowledge fast and cheaply, and build a retrieval-backed QA system that cites sources.
Main Contribution
IDEA-FinBench: bilingual (Chinese/English) benchmark from CPA and CFA exams (4,617 questions) covering 16 subjects and four question formats, with a modular evaluation suite.
IDEA-FinKER: a two-pronged knowledge-injection framework — soft-injecting (retrieval-based few-shot using FinCorpus) and hard-injecting (LoRA instruction fine-tuning) — plus a cleaned FinCorpus (~300k unique Chinese exam questions).
IDEA-FinQA and FinFact: a retrieval-augmented, agent-driven QA pipeline plus FinFact, a 1.5k-item Chinese financial fact-check dataset; system returns answers with citations and shows strong factuality under automatic judging.
Key Findings
GPT-4 leads on IDEA-FinBench across subjects.
Domain knowledge injection (IDEA-FinKER) raises accuracy for weaker models.
Combined soft + hard injection works best.
IDEA-FinQA achieves strong factual answers judged by GPT-4.
Vertical/fine-tuned financial LLMs don't always outperform general models.
Results
Accuracy
Accuracy
IDEA-FinKER improvement (CPA-SA)
IDEA-FinQA factual winning rate
Who Should Care
What To Try In 7 Days
Run IDEA-FinBench (public code) on your LLM to measure finance knowledge gaps.
Add a retrieval layer (embedding DB + top-5 cosine) to provide context for queries.
Apply a small LoRA instruction-tuning pass on a few finance instructions and validate on CPA-style QA.
Agent Features
Memory
- retrieval memory (external FinCorpus/paragraph store)
Planning
- pipeline with staged agents for rewrite → retrieve → extract → generate
Tool Use
- embedding DB (ChromaDB)
- text-index search
- web crawlers for news and reports
Frameworks
- LLaMA-Factory
- LangChain
- LlamaIndex
Is Agentic
true
Architectures
- LLM-based agents (query rewriter, intention detector, extractor, generator)
Collaboration
- multi-agent pipeline (sequential specialized agents)
Optimization Features
Infra Optimization
- LoRA
Model Optimization
- LoRA
System Optimization
- flash-attention used in fine-tuning; bfloat16 training
Training Optimization
- LoRA
Inference Optimization
- not explicitly described beyond standard LLM inference
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- FinCorpus is Chinese-heavy; gains are stronger on Chinese CPA tasks than on English CFA.
- Automatic judging (GPT-4) may introduce bias in factuality evaluation.
- Some publicly released financial LLMs degraded after domain training, showing fine-tuning can harm performance.
- FinCorpus sourcing and cleaning needed heavy heuristics; coverage and labeling quality may vary.
When Not To Use
- Do not rely on this pipeline alone for high-stakes investment or trading decisions without human oversight.
- Avoid using retrieval-only context when retrieval sources cannot be verified or are behind paywalls.
- Not appropriate where millisecond-level latency is required (real-time trading).
Failure Modes
- Bad or missing retrieval leads to confident but wrong answers (hallucination despite RAG).
- Domain fine-tuning can damage a base model's general abilities if instruction data is misaligned.
- Multi-answer (combination) questions are still fragile for some models and retrieval setups.
Core Entities
Models
- GPT-4
- ChatGPT
- LLaMA-2-7B
- LLaMA-2-13B
- Chinese-Alpaca-2-7B
- Chinese-Alpaca-2-13B
- ChatGLM3-6B
- Baichuan2-7B
- Baichuan2-13B
- Qwen-7B
- Qwen-14B
- Yi-6B
- Yi-34B
- DISC-FinLLM
- Tongyi-Finance-14B
- IDEA-FinLLM
Metrics
- Accuracy
- winning rate (judge-based)
Datasets
- IDEA-FinBench
- FinCorpus
- FinFact
- CPA exam questions
- CFA Level 1 & 2 questions
Benchmarks
- IDEA-FinBench
- FinFact
Context Entities
Models
- GPT-3.5 / gpt-4-turbo-preview (used as judge/for generation)
Metrics
- precision/recall not explicitly reported
Datasets
- FinEval
- C-Eval (reference)

