Overview
Production Readiness
0.5
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
14
Why It Matters For Business
FinLLMs help automate common finance language tasks but are uneven: use task-finetuned PLMs for classification/NER to cut cost; reserve large LLMs for complex QA or exploratory uses with human checks.
Summary TLDR
This paper surveys financial large language models (FinLLMs) and their precursors (FinPLMs). It catalogs model families (FinBERT variants, FLANG, BloombergGPT, FinMA, InvestLM, FinGPT), summarizes how they were trained (continual/domain/mixed pretraining, instruction fine-tuning, prompt engineering), and reviews evaluation on six benchmark tasks (sentiment, classification, NER, QA, stock movement, summarization). Mixed-domain PLMs plus task fine-tuning still win on simpler classification and NER tasks; large LLMs (GPT-4) do better on numeric QA and general tasks but fall short of task-specific SOTA on hard finance tasks. The paper lists eight advanced financial tasks and flags core gaps: low
Problem Statement
FinLLM research is nascent. Practitioners need a compact map of which models, training choices, datasets, and benchmarks work for which finance tasks, and where gaps (data quality, numerical reasoning, privacy, hallucination) remain.
Main Contribution
First broad survey connecting general LMs to financial PLMs and FinLLMs.
Comparison of training techniques: continual, domain-specific, mixed, instruction fine-tuning, and prompt engineering.
Summary of performance across six benchmark tasks and a curated list of eight advanced financial tasks and datasets.
Discussion of practical challenges: hallucination, privacy, evaluation gaps, and deployment trade-offs; GitHub collection of datasets and resources.
Key Findings
For sentiment analysis, mixed-domain PLMs achieved top scores, while instruction-finetuned LLMs matched but cost more.
Text classification on financial headlines is solved effectively by mixed-domain models and instruction-tuned LLMs.
Named entity extraction shows large LLMs can match strong PLMs, but many FinLLMs underperform.
Financial question answering with numerical reasoning remains hard for FinLLMs; GPT-4 helps but is below experts.
Stock movement prediction shows small gains for GPT-4 over FinLLMs but task-specific models still lead.
Results
Sentiment analysis F1
Headline classification Avg F1
NER Entity F1
FinQA Exact Match (EM)
Accuracy
Summarization ROUGE-1
Who Should Care
What To Try In 7 Days
Run FPB and FiQA-SA tests comparing a fine-tuned FinBERT vs GPT-4 to measure cost vs accuracy.
Prototype a RAG pipeline over internal documents to ground LLM answers and reduce hallucination.
Fine-tune a small LLaMA-based FinLLM with LoRA/PEFT on one task (sentiment or headline classification).
Optimization Features
Token Efficiency
- Few-shot prompting
Model Optimization
- LoRA
- PEFT
System Optimization
- LLMOps with CI/CD
Training Optimization
- Instruction fine-tuning
- Continual pre-training
- Mixed-domain pre-training
Inference Optimization
- Prompt engineering (few-shot)
- Use smaller finetuned PLMs for cheap inference
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Many evaluations use standard NLP metrics that miss finance-specific costs and risks.
- Key datasets are limited in size, modality, or public availability (proprietary data like Bloomberg).
- Comparisons mix prompting, instruction tuning, and fine-tuning approaches making head-to-head inference noisy.
- Some model/data releases are closed-source, blocking reproducibility and fair comparison.
When Not To Use
- High-stakes numeric decision-making without human verification (e.g., regulatory filings).
- Real-time trading systems that lack rigorous backtesting and risk controls.
- Private or sensitive data without a vetted RAG/privileged-access architecture.
Failure Modes
- Hallucination of facts or numbers not grounded in source documents.
- Poor numerical reasoning on financial tables and reports.
- Performance variance across datasets and prompt styles.
- Privacy leakage when using raw internal data with general LLMs.
Core Entities
Models
- FinBERT-19
- FinBERT-20
- FinBERT-21
- FLANG
- BloombergGPT
- FinMA
- InvestLM
- FinGPT
- GPT-4
- ChatGPT
- LLaMA
- BLOOM
Metrics
- F1
- Accuracy
- EM (Exact Match)
- ROUGE-1
- BERTScore
- Sharpe ratio
Datasets
- Financial PhraseBank (FPB)
- FiQA-SA
- FinQA
- ConvFinQA
- StockNet
- CIKM18
- BigData22
- ECTSum
- FinSBD
- FinRED
- FiNER-139
- FinTabNet
- SemEval-2017
- StockEmotions
- Headline FIN
Benchmarks
- FLUE
- FinQA
- ConvFinQA
- FPB
- ECTSum
Context Entities
Models
- BERT
- ELECTRA
- GPT-3
- GPT-3.5
Metrics
- ROUGE-2
- ROUGE-L
Datasets
- FinPile
- SEC EDGAR
- Seeking Alpha
- Reuters
- Investopedia
Benchmarks
- FiQA-QA
- FedNLP
- FOMC

