Overview
The benchmark is practical and reproducible for small teams; results are empirical on several public datasets and six 7B models, but conclusions are limited to this model size and the reformulated classification settings.
Citations8
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
You can adapt open-source 7B LLMs to finance tasks cheaply and reproducibly; choose models per task (Llama2 for general use, BLOOM for extraction, chat models for zero-shot).
Who Should Care
Summary TLDR
This paper builds a practical instruction-tuning benchmark (FinGPT Benchmark) for finance tasks and tests six 7B open-source LLMs (Llama2, Falcon, MPT, BLOOM, ChatGLM2, Qwen). They use LoRA (rank 8) to cheaply fine-tune models on sentiment, headline classification, NER, and relation extraction. Key takeaways: Llama2 ranks best overall, multi-task tuning strongly improves relation extraction (big F1 gains), multi-task slightly hurts some classification tasks, and chat-style models (ChatGLM2) generalize better in zero-shot tests. Code and data are public and the total reported tuning cost was $302.4 using 4x RTX3090 GPUs.
Problem Statement
Open-source LLMs are promising for finance but there is no shared, reproducible instruction-tuning benchmark to compare models across standard financial NLP tasks. Practitioners need a low-cost, repeatable pipeline to evaluate task-specific, multi-task, and zero-shot performance on finance datasets.
Main Contribution
A three-phase instruction-tuning pipeline for finance: task-specific, multi-task, and zero-shot.
A cost-aware benchmark that fine-tunes six 7B open-source LLMs on sentiment, headline classification, NER, and relation extraction, with public code.
Key Findings
Llama2 had the best overall ranking across tasks.
Multi-task instruction tuning produced large gains for relation extraction.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Sentiment Analysis (task-specific average F1) | Llama2 0.820, MPT 0.821, Qwen 0.811, Falcon 0.804, ChatGLM2 0.798, BLOOM 0.748 | — | — | Average over FPB, FiQA, TFNS, NWGI (Table 3) | Table 3 reports task-specific average F1 per model | Table 3 |
| Relation Extraction F1 (Llama2) | Task-specific 0.395 → Multi-task 0.674 | task-specific 0.395 | +0.279 absolute (+27.2% reported) | FinRED (Table 4) | Multi-task RE improved substantially for most models | Table 4 |
What To Try In 7 Days
Clone FinGPT repo and run provided instruction-tuning recipe on one 7B model with LoRA.
Run task-specific tuning on your smallest labeled dataset (start with sentiment or headline classification).
Try multi-task tuning if you care about relation extraction; compare RE F1 before/after multi-tasking.
Agent Features
Tool Use
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Experiments cover only ~7B models; results may not hold for much larger models.
NER dataset is small (609 / 1003 samples), so NER results are noisy.
When Not To Use
For high-stakes finance tasks that need calibrated probabilities and strict factual recall.
When you require span-level NER outputs (paper converts NER to classification).
Failure Modes
Hallucination on unseen tasks or when instruction diversity is low.
Task interference in multi-task settings that can hurt classification accuracy.

