Survey of financial LLMs: techniques, benchmarks, and practical gaps

Overview

Decision SnapshotNeeds Validation

The survey compiles multiple published results but many models and datasets are heterogeneous, so conclusions are conditional on specific tasks and data.

Citations14

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 40%

Authors

Jean Lee, Nicholas Stevens, Soyeon Caren Han, Minseok Song

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FinLLMs help automate common finance language tasks but are uneven: use task-finetuned PLMs for classification/NER to cut cost; reserve large LLMs for complex QA or exploratory uses with human checks.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

This paper surveys financial large language models (FinLLMs) and their precursors (FinPLMs). It catalogs model families (FinBERT variants, FLANG, BloombergGPT, FinMA, InvestLM, FinGPT), summarizes how they were trained (continual/domain/mixed pretraining, instruction fine-tuning, prompt engineering), and reviews evaluation on six benchmark tasks (sentiment, classification, NER, QA, stock movement, summarization). Mixed-domain PLMs plus task fine-tuning still win on simpler classification and NER tasks; large LLMs (GPT-4) do better on numeric QA and general tasks but fall short of task-specific SOTA on hard finance tasks. The paper lists eight advanced financial tasks and flags core gaps: low

Problem Statement

FinLLM research is nascent. Practitioners need a compact map of which models, training choices, datasets, and benchmarks work for which finance tasks, and where gaps (data quality, numerical reasoning, privacy, hallucination) remain.

Main Contribution

First broad survey connecting general LMs to financial PLMs and FinLLMs.

Comparison of training techniques: continual, domain-specific, mixed, instruction fine-tuning, and prompt engineering.

Key Findings

For sentiment analysis, mixed-domain PLMs achieved top scores, while instruction-finetuned LLMs matched but cost more.

NumbersFLANG-ELECTRA F1=92%; FinMA-30B/GPT-4 F1≈87% (5-shot)

Practical UseUse a fine-tuned PLM (cheaper) for production sentiment tasks; reserve large LLM calls for edge cases.

Evidence RefSection 4.1; Table 1

Text classification on financial headlines is solved effectively by mixed-domain models and instruction-tuned LLMs.

NumbersFLANG & FinMA-30B Avg F1≈98%; BERT/FinBERT≈97%

Practical UseFor headline classification, prefer task-finetuned PLMs or small FinLLMs to save cost without much accuracy loss.

Evidence RefSection 4.2; Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Sentiment analysis F1	92% (FLANG-ELECTRA)	FinMA-30B / GPT-4 ≈87% (5-shot)	≈+5pp over FinMA/GPT-4	Financial PhraseBank (FPB)	Section 4.1	4.1
Headline classification Avg F1	≈98% (FLANG, FinMA-30B)	BERT/FinBERT ≈97%	≈+1pp	Headline FIN	Section 4.2; Table 1	4.2

What To Try In 7 Days

Run FPB and FiQA-SA tests comparing a fine-tuned FinBERT vs GPT-4 to measure cost vs accuracy.

Prototype a RAG pipeline over internal documents to ground LLM answers and reduce hallucination.

Fine-tune a small LLaMA-based FinLLM with LoRA/PEFT on one task (sentiment or headline classification).

Optimization Features

Token Efficiency

Few-shot prompting

Model Optimization

LoRAPEFT

System Optimization

LLMOps with CI/CD

Training Optimization

Instruction fine-tuningContinual pre-trainingMixed-domain pre-training

Inference Optimization

Prompt engineering (few-shot)Use smaller finetuned PLMs for cheap inference

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/adlnlp/FinLLMs https://github.com/chancefocus/PIXIU https://github.com/AI4Finance-Foundation/FinGPT

Data URLs

https://github.com/adlnlp/FinLLMs

Risks & Boundaries

Limitations

Many evaluations use standard NLP metrics that miss finance-specific costs and risks.

Key datasets are limited in size, modality, or public availability (proprietary data like Bloomberg).

When Not To Use

High-stakes numeric decision-making without human verification (e.g., regulatory filings).

Real-time trading systems that lack rigorous backtesting and risk controls.

Failure Modes

Hallucination of facts or numbers not grounded in source documents.

Poor numerical reasoning on financial tables and reports.

Core Entities

Models

FinBERT-19FinBERT-20FinBERT-21FLANGBloombergGPTFinMAInvestLMFinGPTGPT-4ChatGPTLLaMABLOOM

Metrics

F1AccuracyEM (Exact Match)ROUGE-1BERTScoreSharpe ratio

Datasets

Financial PhraseBank (FPB)FiQA-SAFinQAConvFinQAStockNetCIKM18BigData22ECTSumFinSBDFinREDFiNER-139FinTabNetSemEval-2017StockEmotionsHeadline FIN

Benchmarks

FLUEFinQAConvFinQAFPBECTSum

Context Entities

Models

BERTELECTRAGPT-3GPT-3.5

Metrics

ROUGE-2ROUGE-L

Datasets

FinPileSEC EDGARSeeking AlphaReutersInvestopedia

Benchmarks

FiQA-QAFedNLPFOMC

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

For sentiment analysis, mixed-domain PLMs achieved top scores, while instruction-finetuned LLMs matched but cost more.

Text classification on financial headlines is solved effectively by mixed-domain models and instruction-tuned LLMs.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding