FinGPT: instruction-tuning benchmark that evaluates six open-source LLMs on core financial NLP tasks

October 7, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

8

Authors

Neng Wang, Hongyang Yang, Christina Dan Wang

Links

Abstract / PDF

Why It Matters For Business

You can adapt open-source 7B LLMs to finance tasks cheaply and reproducibly; choose models per task (Llama2 for general use, BLOOM for extraction, chat models for zero-shot).

Summary TLDR

This paper builds a practical instruction-tuning benchmark (FinGPT Benchmark) for finance tasks and tests six 7B open-source LLMs (Llama2, Falcon, MPT, BLOOM, ChatGLM2, Qwen). They use LoRA (rank 8) to cheaply fine-tune models on sentiment, headline classification, NER, and relation extraction. Key takeaways: Llama2 ranks best overall, multi-task tuning strongly improves relation extraction (big F1 gains), multi-task slightly hurts some classification tasks, and chat-style models (ChatGLM2) generalize better in zero-shot tests. Code and data are public and the total reported tuning cost was $302.4 using 4x RTX3090 GPUs.

Problem Statement

Open-source LLMs are promising for finance but there is no shared, reproducible instruction-tuning benchmark to compare models across standard financial NLP tasks. Practitioners need a low-cost, repeatable pipeline to evaluate task-specific, multi-task, and zero-shot performance on finance datasets.

Main Contribution

A three-phase instruction-tuning pipeline for finance: task-specific, multi-task, and zero-shot.

A cost-aware benchmark that fine-tunes six 7B open-source LLMs on sentiment, headline classification, NER, and relation extraction, with public code.

Empirical comparison showing model strengths vary by task (e.g., Llama2 strong overall; BLOOM strong on information extraction).

Practical recipes: LoRA setup, training schedules, checkpoint selection and reported compute cost.

Key Findings

Llama2 had the best overall ranking across tasks.

NumbersAvg ranking = 2.0 across SA, NER, HC, RE (Table 2)

Multi-task instruction tuning produced large gains for relation extraction.

NumbersLlama2 RE: 0.395 → 0.674 (+27.2% absolute F1 gain, Table 4)

Multi-task tuning slightly reduced classification performance on average.

NumbersSA average change (Llama2) −1.3% after multi-task tuning (Table 3)

Chat-style model generalized best in zero-shot sentiment tests.

NumbersZero-shot FPB F1: ChatGLM2 = 0.803 (Table 5)

Instruction tuning cost reported as modest for many models.

NumbersTotal training time ≈ 90 GPU-hours; cost = $302.4 on 4× RTX3090 (Section 4.3)

Results

Sentiment Analysis (task-specific average F1)

ValueLlama2 0.820, MPT 0.821, Qwen 0.811, Falcon 0.804, ChatGLM2 0.798, BLOOM 0.748

Relation Extraction F1 (Llama2)

ValueTask-specific 0.395 → Multi-task 0.674

Baselinetask-specific 0.395

Zero-shot Sentiment F1 (FPB)

ValueChatGLM2 0.803, Falcon 0.791, Llama2 0.621

Total reported training cost

Value$302.4

Who Should Care

What To Try In 7 Days

Clone FinGPT repo and run provided instruction-tuning recipe on one 7B model with LoRA.

Run task-specific tuning on your smallest labeled dataset (start with sentiment or headline classification).

Try multi-task tuning if you care about relation extraction; compare RE F1 before/after multi-tasking.

Agent Features

Tool Use

  • LoRA

Architectures

  • transformer

Optimization Features

Token Efficiency

  • max token length = 512

Infra Optimization

  • 4× RTX3090 GPUs reported

Model Optimization

  • LoRA
  • FP16 inference/training

System Optimization

  • Checkpoint selection by SA evaluation loss

Training Optimization

  • Gradient accumulation (8 steps)
  • AdamW optimizer with linear decay and 3% warmup

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Experiments cover only ~7B models; results may not hold for much larger models.
  • NER dataset is small (609 / 1003 samples), so NER results are noisy.
  • Zero-shot tests exclude neutral labels; that narrows realism of real-world sentiment tasks.
  • NER and RE were reformulated into classification (NER(CLS), RE(CLS)), which reduces realism for span extraction tasks.

When Not To Use

  • For high-stakes finance tasks that need calibrated probabilities and strict factual recall.
  • When you require span-level NER outputs (paper converts NER to classification).
  • If you must support neutral sentiment in zero-shot scenarios (neutral excluded).

Failure Modes

  • Hallucination on unseen tasks or when instruction diversity is low.
  • Task interference in multi-task settings that can hurt classification accuracy.
  • Bias from small or unbalanced financial datasets causing overfitting.

Core Entities

Models

  • Llama2-7B
  • Falcon-7B
  • MPT-7B
  • BLOOM-7.1B
  • ChatGLM2-6B
  • Qwen-7B

Metrics

  • F1-score
  • entity-level F1
  • relation-only F1

Datasets

  • FPB
  • FiQA-SA
  • TFNS
  • NWGI
  • Headline
  • NER (domain)
  • FinRED
  • FLUE

Benchmarks

  • FinGPT Benchmark (this paper)
  • FLUE