Survey of financial LLMs: techniques, benchmarks, and practical gaps

February 4, 20247 min

Overview

Decision SnapshotNeeds Validation

The survey compiles multiple published results but many models and datasets are heterogeneous, so conclusions are conditional on specific tasks and data.

Citations14

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 40%

Authors

Jean Lee, Nicholas Stevens, Soyeon Caren Han, Minseok Song

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FinLLMs help automate common finance language tasks but are uneven: use task-finetuned PLMs for classification/NER to cut cost; reserve large LLMs for complex QA or exploratory uses with human checks.

Who Should Care

Summary TLDR

This paper surveys financial large language models (FinLLMs) and their precursors (FinPLMs). It catalogs model families (FinBERT variants, FLANG, BloombergGPT, FinMA, InvestLM, FinGPT), summarizes how they were trained (continual/domain/mixed pretraining, instruction fine-tuning, prompt engineering), and reviews evaluation on six benchmark tasks (sentiment, classification, NER, QA, stock movement, summarization). Mixed-domain PLMs plus task fine-tuning still win on simpler classification and NER tasks; large LLMs (GPT-4) do better on numeric QA and general tasks but fall short of task-specific SOTA on hard finance tasks. The paper lists eight advanced financial tasks and flags core gaps: low

Problem Statement

FinLLM research is nascent. Practitioners need a compact map of which models, training choices, datasets, and benchmarks work for which finance tasks, and where gaps (data quality, numerical reasoning, privacy, hallucination) remain.

Main Contribution

First broad survey connecting general LMs to financial PLMs and FinLLMs.

Comparison of training techniques: continual, domain-specific, mixed, instruction fine-tuning, and prompt engineering.

Key Findings

For sentiment analysis, mixed-domain PLMs achieved top scores, while instruction-finetuned LLMs matched but cost more.

NumbersFLANG-ELECTRA F1=92%; FinMA-30B/GPT-4 F1≈87% (5-shot)

Practical UseUse a fine-tuned PLM (cheaper) for production sentiment tasks; reserve large LLM calls for edge cases.

Evidence RefSection 4.1; Table 1

Text classification on financial headlines is solved effectively by mixed-domain models and instruction-tuned LLMs.

NumbersFLANG & FinMA-30B Avg F1≈98%; BERT/FinBERT≈97%

Practical UseFor headline classification, prefer task-finetuned PLMs or small FinLLMs to save cost without much accuracy loss.

Evidence RefSection 4.2; Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Sentiment analysis F192% (FLANG-ELECTRA)FinMA-30B / GPT-4 ≈87% (5-shot)≈+5pp over FinMA/GPT-4Financial PhraseBank (FPB)Section 4.14.1
Headline classification Avg F1≈98% (FLANG, FinMA-30B)BERT/FinBERT ≈97%≈+1ppHeadline FINSection 4.2; Table 14.2

What To Try In 7 Days

Run FPB and FiQA-SA tests comparing a fine-tuned FinBERT vs GPT-4 to measure cost vs accuracy.

Prototype a RAG pipeline over internal documents to ground LLM answers and reduce hallucination.

Fine-tune a small LLaMA-based FinLLM with LoRA/PEFT on one task (sentiment or headline classification).

Optimization Features

Token Efficiency
Few-shot prompting
Model Optimization
LoRAPEFT
System Optimization
LLMOps with CI/CD
Training Optimization
Instruction fine-tuningContinual pre-trainingMixed-domain pre-training
Inference Optimization
Prompt engineering (few-shot)Use smaller finetuned PLMs for cheap inference

Reproducibility

Risks & Boundaries

Limitations

Many evaluations use standard NLP metrics that miss finance-specific costs and risks.

Key datasets are limited in size, modality, or public availability (proprietary data like Bloomberg).

When Not To Use

High-stakes numeric decision-making without human verification (e.g., regulatory filings).

Real-time trading systems that lack rigorous backtesting and risk controls.

Failure Modes

Hallucination of facts or numbers not grounded in source documents.

Poor numerical reasoning on financial tables and reports.

Core Entities

Models

FinBERT-19FinBERT-20FinBERT-21FLANGBloombergGPTFinMAInvestLMFinGPTGPT-4ChatGPTLLaMABLOOM

Metrics

F1AccuracyEM (Exact Match)ROUGE-1BERTScoreSharpe ratio

Datasets

Financial PhraseBank (FPB)FiQA-SAFinQAConvFinQAStockNetCIKM18BigData22ECTSumFinSBDFinREDFiNER-139FinTabNetSemEval-2017StockEmotionsHeadline FIN

Benchmarks

FLUEFinQAConvFinQAFPBECTSum

Context Entities

Models

BERTELECTRAGPT-3GPT-3.5

Metrics

ROUGE-2ROUGE-L

Datasets

FinPileSEC EDGARSeeking AlphaReutersInvestopedia

Benchmarks

FiQA-QAFedNLPFOMC