FinGPT: instruction-tuning benchmark that evaluates six open-source LLMs on core financial NLP tasks

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and reproducible for small teams; results are empirical on several public datasets and six 7B models, but conclusions are limited to this model size and the reformulated classification settings.

Citations8

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Neng Wang, Hongyang Yang, Christina Dan Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can adapt open-source 7B LLMs to finance tasks cheaply and reproducibly; choose models per task (Llama2 for general use, BLOOM for extraction, chat models for zero-shot).

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

This paper builds a practical instruction-tuning benchmark (FinGPT Benchmark) for finance tasks and tests six 7B open-source LLMs (Llama2, Falcon, MPT, BLOOM, ChatGLM2, Qwen). They use LoRA (rank 8) to cheaply fine-tune models on sentiment, headline classification, NER, and relation extraction. Key takeaways: Llama2 ranks best overall, multi-task tuning strongly improves relation extraction (big F1 gains), multi-task slightly hurts some classification tasks, and chat-style models (ChatGLM2) generalize better in zero-shot tests. Code and data are public and the total reported tuning cost was $302.4 using 4x RTX3090 GPUs.

Problem Statement

Open-source LLMs are promising for finance but there is no shared, reproducible instruction-tuning benchmark to compare models across standard financial NLP tasks. Practitioners need a low-cost, repeatable pipeline to evaluate task-specific, multi-task, and zero-shot performance on finance datasets.

Main Contribution

A three-phase instruction-tuning pipeline for finance: task-specific, multi-task, and zero-shot.

A cost-aware benchmark that fine-tunes six 7B open-source LLMs on sentiment, headline classification, NER, and relation extraction, with public code.

Key Findings

Llama2 had the best overall ranking across tasks.

NumbersAvg ranking = 2.0 across SA, NER, HC, RE (Table 2)

Practical UseStart with Llama2 when you need a reliable, general-purpose open-source base for finance tasks.

Evidence RefTable 2

Multi-task instruction tuning produced large gains for relation extraction.

NumbersLlama2 RE: 0.395 → 0.674 (+27.2% absolute F1 gain, Table 4)

Practical UseCombine related tasks in multi-task tuning to markedly improve relation extraction performance.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Sentiment Analysis (task-specific average F1)	Llama2 0.820, MPT 0.821, Qwen 0.811, Falcon 0.804, ChatGLM2 0.798, BLOOM 0.748	—	—	Average over FPB, FiQA, TFNS, NWGI (Table 3)	Table 3 reports task-specific average F1 per model	Table 3
Relation Extraction F1 (Llama2)	Task-specific 0.395 → Multi-task 0.674	task-specific 0.395	+0.279 absolute (+27.2% reported)	FinRED (Table 4)	Multi-task RE improved substantially for most models	Table 4

What To Try In 7 Days

Clone FinGPT repo and run provided instruction-tuning recipe on one 7B model with LoRA.

Run task-specific tuning on your smallest labeled dataset (start with sentiment or headline classification).

Try multi-task tuning if you care about relation extraction; compare RE F1 before/after multi-tasking.

Agent Features

Tool Use

LoRA

Architectures

transformer

Optimization Features

Token Efficiency

max token length = 512

Infra Optimization

4× RTX3090 GPUs reported

Model Optimization

LoRAFP16 inference/training

System Optimization

Checkpoint selection by SA evaluation loss

Training Optimization

Gradient accumulation (8 steps)AdamW optimizer with linear decay and 3% warmup

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/AI4Finance-Foundation/FinGPT/tree/master/fingpt/FinGPT_Benchmark

Data URLs

https://github.com/AI4Finance-Foundation/FinGPTFLUE benchmark (cited)

Risks & Boundaries

Limitations

Experiments cover only ~7B models; results may not hold for much larger models.

NER dataset is small (609 / 1003 samples), so NER results are noisy.

When Not To Use

For high-stakes finance tasks that need calibrated probabilities and strict factual recall.

When you require span-level NER outputs (paper converts NER to classification).

Failure Modes

Hallucination on unseen tasks or when instruction diversity is low.

Task interference in multi-task settings that can hurt classification accuracy.

Core Entities

Models

Llama2-7BFalcon-7BMPT-7BBLOOM-7.1BChatGLM2-6BQwen-7B

Metrics

F1-scoreentity-level F1relation-only F1

Datasets

FPBFiQA-SATFNSNWGIHeadlineNER (domain)FinREDFLUE

Benchmarks

FinGPT Benchmark (this paper)FLUE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Llama2 had the best overall ranking across tasks.

Multi-task instruction tuning produced large gains for relation extraction.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding