Survey of financial LLMs: techniques, benchmarks, and practical gaps

February 4, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

14

Authors

Jean Lee, Nicholas Stevens, Soyeon Caren Han, Minseok Song

Links

Abstract / PDF

Why It Matters For Business

FinLLMs help automate common finance language tasks but are uneven: use task-finetuned PLMs for classification/NER to cut cost; reserve large LLMs for complex QA or exploratory uses with human checks.

Summary TLDR

This paper surveys financial large language models (FinLLMs) and their precursors (FinPLMs). It catalogs model families (FinBERT variants, FLANG, BloombergGPT, FinMA, InvestLM, FinGPT), summarizes how they were trained (continual/domain/mixed pretraining, instruction fine-tuning, prompt engineering), and reviews evaluation on six benchmark tasks (sentiment, classification, NER, QA, stock movement, summarization). Mixed-domain PLMs plus task fine-tuning still win on simpler classification and NER tasks; large LLMs (GPT-4) do better on numeric QA and general tasks but fall short of task-specific SOTA on hard finance tasks. The paper lists eight advanced financial tasks and flags core gaps: low

Problem Statement

FinLLM research is nascent. Practitioners need a compact map of which models, training choices, datasets, and benchmarks work for which finance tasks, and where gaps (data quality, numerical reasoning, privacy, hallucination) remain.

Main Contribution

First broad survey connecting general LMs to financial PLMs and FinLLMs.

Comparison of training techniques: continual, domain-specific, mixed, instruction fine-tuning, and prompt engineering.

Summary of performance across six benchmark tasks and a curated list of eight advanced financial tasks and datasets.

Discussion of practical challenges: hallucination, privacy, evaluation gaps, and deployment trade-offs; GitHub collection of datasets and resources.

Key Findings

For sentiment analysis, mixed-domain PLMs achieved top scores, while instruction-finetuned LLMs matched but cost more.

NumbersFLANG-ELECTRA F1=92%; FinMA-30B/GPT-4 F1≈87% (5-shot)

Text classification on financial headlines is solved effectively by mixed-domain models and instruction-tuned LLMs.

NumbersFLANG & FinMA-30B Avg F1≈98%; BERT/FinBERT≈97%

Named entity extraction shows large LLMs can match strong PLMs, but many FinLLMs underperform.

NumbersGPT-4 entity F1=83%; FLANG≈82%; other FinLLMs 61%–69%

Financial question answering with numerical reasoning remains hard for FinLLMs; GPT-4 helps but is below experts.

NumbersGPT-4 EM=69%–76%; human experts avg EM≈90%; BloombergGPT EM=43%

Stock movement prediction shows small gains for GPT-4 over FinLLMs but task-specific models still lead.

NumbersGPT-4 accuracy≈54%; FinMA≈52%; task-SOTA≈58%

Results

Sentiment analysis F1

Value92% (FLANG-ELECTRA)

BaselineFinMA-30B / GPT-4 ≈87% (5-shot)

Headline classification Avg F1

Value≈98% (FLANG, FinMA-30B)

BaselineBERT/FinBERT ≈97%

NER Entity F1

Value83% (GPT-4, 5-shot)

BaselineFLANG ≈82%

FinQA Exact Match (EM)

Value69%–76% (GPT-4 zero-shot)

BaselineBloombergGPT 43%

Accuracy

Value54% (GPT-4 zero-shot)

BaselineSOTA ≈58%

Summarization ROUGE-1

Value30% (GPT-4 zero-shot)

BaselineTask SOTA 47%

Who Should Care

What To Try In 7 Days

Run FPB and FiQA-SA tests comparing a fine-tuned FinBERT vs GPT-4 to measure cost vs accuracy.

Prototype a RAG pipeline over internal documents to ground LLM answers and reduce hallucination.

Fine-tune a small LLaMA-based FinLLM with LoRA/PEFT on one task (sentiment or headline classification).

Optimization Features

Token Efficiency

  • Few-shot prompting

Model Optimization

  • LoRA
  • PEFT

System Optimization

  • LLMOps with CI/CD

Training Optimization

  • Instruction fine-tuning
  • Continual pre-training
  • Mixed-domain pre-training

Inference Optimization

  • Prompt engineering (few-shot)
  • Use smaller finetuned PLMs for cheap inference

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Many evaluations use standard NLP metrics that miss finance-specific costs and risks.
  • Key datasets are limited in size, modality, or public availability (proprietary data like Bloomberg).
  • Comparisons mix prompting, instruction tuning, and fine-tuning approaches making head-to-head inference noisy.
  • Some model/data releases are closed-source, blocking reproducibility and fair comparison.

When Not To Use

  • High-stakes numeric decision-making without human verification (e.g., regulatory filings).
  • Real-time trading systems that lack rigorous backtesting and risk controls.
  • Private or sensitive data without a vetted RAG/privileged-access architecture.

Failure Modes

  • Hallucination of facts or numbers not grounded in source documents.
  • Poor numerical reasoning on financial tables and reports.
  • Performance variance across datasets and prompt styles.
  • Privacy leakage when using raw internal data with general LLMs.

Core Entities

Models

  • FinBERT-19
  • FinBERT-20
  • FinBERT-21
  • FLANG
  • BloombergGPT
  • FinMA
  • InvestLM
  • FinGPT
  • GPT-4
  • ChatGPT
  • LLaMA
  • BLOOM

Metrics

  • F1
  • Accuracy
  • EM (Exact Match)
  • ROUGE-1
  • BERTScore
  • Sharpe ratio

Datasets

  • Financial PhraseBank (FPB)
  • FiQA-SA
  • FinQA
  • ConvFinQA
  • StockNet
  • CIKM18
  • BigData22
  • ECTSum
  • FinSBD
  • FinRED
  • FiNER-139
  • FinTabNet
  • SemEval-2017
  • StockEmotions
  • Headline FIN

Benchmarks

  • FLUE
  • FinQA
  • ConvFinQA
  • FPB
  • ECTSum

Context Entities

Models

  • BERT
  • ELECTRA
  • GPT-3
  • GPT-3.5

Metrics

  • ROUGE-2
  • ROUGE-L

Datasets

  • FinPile
  • SEC EDGAR
  • Seeking Alpha
  • Reuters
  • Investopedia

Benchmarks

  • FiQA-QA
  • FedNLP
  • FOMC