FinGPT: instruction-tuning benchmark that evaluates six open-source LLMs on core financial NLP tasks

October 7, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and reproducible for small teams; results are empirical on several public datasets and six 7B models, but conclusions are limited to this model size and the reformulated classification settings.

Citations8

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Neng Wang, Hongyang Yang, Christina Dan Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can adapt open-source 7B LLMs to finance tasks cheaply and reproducibly; choose models per task (Llama2 for general use, BLOOM for extraction, chat models for zero-shot).

Who Should Care

Summary TLDR

This paper builds a practical instruction-tuning benchmark (FinGPT Benchmark) for finance tasks and tests six 7B open-source LLMs (Llama2, Falcon, MPT, BLOOM, ChatGLM2, Qwen). They use LoRA (rank 8) to cheaply fine-tune models on sentiment, headline classification, NER, and relation extraction. Key takeaways: Llama2 ranks best overall, multi-task tuning strongly improves relation extraction (big F1 gains), multi-task slightly hurts some classification tasks, and chat-style models (ChatGLM2) generalize better in zero-shot tests. Code and data are public and the total reported tuning cost was $302.4 using 4x RTX3090 GPUs.

Problem Statement

Open-source LLMs are promising for finance but there is no shared, reproducible instruction-tuning benchmark to compare models across standard financial NLP tasks. Practitioners need a low-cost, repeatable pipeline to evaluate task-specific, multi-task, and zero-shot performance on finance datasets.

Main Contribution

A three-phase instruction-tuning pipeline for finance: task-specific, multi-task, and zero-shot.

A cost-aware benchmark that fine-tunes six 7B open-source LLMs on sentiment, headline classification, NER, and relation extraction, with public code.

Key Findings

Llama2 had the best overall ranking across tasks.

NumbersAvg ranking = 2.0 across SA, NER, HC, RE (Table 2)

Practical UseStart with Llama2 when you need a reliable, general-purpose open-source base for finance tasks.

Evidence RefTable 2

Multi-task instruction tuning produced large gains for relation extraction.

NumbersLlama2 RE: 0.3950.674 (+27.2% absolute F1 gain, Table 4)

Practical UseCombine related tasks in multi-task tuning to markedly improve relation extraction performance.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Sentiment Analysis (task-specific average F1)Llama2 0.820, MPT 0.821, Qwen 0.811, Falcon 0.804, ChatGLM2 0.798, BLOOM 0.748Average over FPB, FiQA, TFNS, NWGI (Table 3)Table 3 reports task-specific average F1 per modelTable 3
Relation Extraction F1 (Llama2)Task-specific 0.395 → Multi-task 0.674task-specific 0.395+0.279 absolute (+27.2% reported)FinRED (Table 4)Multi-task RE improved substantially for most modelsTable 4

What To Try In 7 Days

Clone FinGPT repo and run provided instruction-tuning recipe on one 7B model with LoRA.

Run task-specific tuning on your smallest labeled dataset (start with sentiment or headline classification).

Try multi-task tuning if you care about relation extraction; compare RE F1 before/after multi-tasking.

Agent Features

Tool Use
LoRA
Architectures
transformer

Optimization Features

Token Efficiency
max token length = 512
Infra Optimization
4× RTX3090 GPUs reported
Model Optimization
LoRAFP16 inference/training
System Optimization
Checkpoint selection by SA evaluation loss
Training Optimization
Gradient accumulation (8 steps)AdamW optimizer with linear decay and 3% warmup

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Experiments cover only ~7B models; results may not hold for much larger models.

NER dataset is small (609 / 1003 samples), so NER results are noisy.

When Not To Use

For high-stakes finance tasks that need calibrated probabilities and strict factual recall.

When you require span-level NER outputs (paper converts NER to classification).

Failure Modes

Hallucination on unseen tasks or when instruction diversity is low.

Task interference in multi-task settings that can hurt classification accuracy.

Core Entities

Models

Llama2-7BFalcon-7BMPT-7BBLOOM-7.1BChatGLM2-6BQwen-7B

Metrics

F1-scoreentity-level F1relation-only F1

Datasets

FPBFiQA-SATFNSNWGIHeadlineNER (domain)FinREDFLUE

Benchmarks

FinGPT Benchmark (this paper)FLUE