Overview
The survey compiles multiple published results but many models and datasets are heterogeneous, so conclusions are conditional on specific tasks and data.
Citations14
Evidence Strength0.70
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 40%
Why It Matters For Business
FinLLMs help automate common finance language tasks but are uneven: use task-finetuned PLMs for classification/NER to cut cost; reserve large LLMs for complex QA or exploratory uses with human checks.
Who Should Care
Summary TLDR
This paper surveys financial large language models (FinLLMs) and their precursors (FinPLMs). It catalogs model families (FinBERT variants, FLANG, BloombergGPT, FinMA, InvestLM, FinGPT), summarizes how they were trained (continual/domain/mixed pretraining, instruction fine-tuning, prompt engineering), and reviews evaluation on six benchmark tasks (sentiment, classification, NER, QA, stock movement, summarization). Mixed-domain PLMs plus task fine-tuning still win on simpler classification and NER tasks; large LLMs (GPT-4) do better on numeric QA and general tasks but fall short of task-specific SOTA on hard finance tasks. The paper lists eight advanced financial tasks and flags core gaps: low
Problem Statement
FinLLM research is nascent. Practitioners need a compact map of which models, training choices, datasets, and benchmarks work for which finance tasks, and where gaps (data quality, numerical reasoning, privacy, hallucination) remain.
Main Contribution
First broad survey connecting general LMs to financial PLMs and FinLLMs.
Comparison of training techniques: continual, domain-specific, mixed, instruction fine-tuning, and prompt engineering.
Key Findings
For sentiment analysis, mixed-domain PLMs achieved top scores, while instruction-finetuned LLMs matched but cost more.
Text classification on financial headlines is solved effectively by mixed-domain models and instruction-tuned LLMs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Sentiment analysis F1 | 92% (FLANG-ELECTRA) | FinMA-30B / GPT-4 ≈87% (5-shot) | ≈+5pp over FinMA/GPT-4 | Financial PhraseBank (FPB) | Section 4.1 | 4.1 |
| Headline classification Avg F1 | ≈98% (FLANG, FinMA-30B) | BERT/FinBERT ≈97% | ≈+1pp | Headline FIN | Section 4.2; Table 1 | 4.2 |
What To Try In 7 Days
Run FPB and FiQA-SA tests comparing a fine-tuned FinBERT vs GPT-4 to measure cost vs accuracy.
Prototype a RAG pipeline over internal documents to ground LLM answers and reduce hallucination.
Fine-tune a small LLaMA-based FinLLM with LoRA/PEFT on one task (sentiment or headline classification).
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Many evaluations use standard NLP metrics that miss finance-specific costs and risks.
Key datasets are limited in size, modality, or public availability (proprietary data like Bloomberg).
When Not To Use
High-stakes numeric decision-making without human verification (e.g., regulatory filings).
Real-time trading systems that lack rigorous backtesting and risk controls.
Failure Modes
Hallucination of facts or numbers not grounded in source documents.
Poor numerical reasoning on financial tables and reports.

