Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
8
Why It Matters For Business
You can adapt open-source 7B LLMs to finance tasks cheaply and reproducibly; choose models per task (Llama2 for general use, BLOOM for extraction, chat models for zero-shot).
Summary TLDR
This paper builds a practical instruction-tuning benchmark (FinGPT Benchmark) for finance tasks and tests six 7B open-source LLMs (Llama2, Falcon, MPT, BLOOM, ChatGLM2, Qwen). They use LoRA (rank 8) to cheaply fine-tune models on sentiment, headline classification, NER, and relation extraction. Key takeaways: Llama2 ranks best overall, multi-task tuning strongly improves relation extraction (big F1 gains), multi-task slightly hurts some classification tasks, and chat-style models (ChatGLM2) generalize better in zero-shot tests. Code and data are public and the total reported tuning cost was $302.4 using 4x RTX3090 GPUs.
Problem Statement
Open-source LLMs are promising for finance but there is no shared, reproducible instruction-tuning benchmark to compare models across standard financial NLP tasks. Practitioners need a low-cost, repeatable pipeline to evaluate task-specific, multi-task, and zero-shot performance on finance datasets.
Main Contribution
A three-phase instruction-tuning pipeline for finance: task-specific, multi-task, and zero-shot.
A cost-aware benchmark that fine-tunes six 7B open-source LLMs on sentiment, headline classification, NER, and relation extraction, with public code.
Empirical comparison showing model strengths vary by task (e.g., Llama2 strong overall; BLOOM strong on information extraction).
Practical recipes: LoRA setup, training schedules, checkpoint selection and reported compute cost.
Key Findings
Llama2 had the best overall ranking across tasks.
Multi-task instruction tuning produced large gains for relation extraction.
Multi-task tuning slightly reduced classification performance on average.
Chat-style model generalized best in zero-shot sentiment tests.
Instruction tuning cost reported as modest for many models.
Results
Sentiment Analysis (task-specific average F1)
Relation Extraction F1 (Llama2)
Zero-shot Sentiment F1 (FPB)
Total reported training cost
Who Should Care
What To Try In 7 Days
Clone FinGPT repo and run provided instruction-tuning recipe on one 7B model with LoRA.
Run task-specific tuning on your smallest labeled dataset (start with sentiment or headline classification).
Try multi-task tuning if you care about relation extraction; compare RE F1 before/after multi-tasking.
Agent Features
Tool Use
- LoRA
Architectures
- transformer
Optimization Features
Token Efficiency
- max token length = 512
Infra Optimization
- 4× RTX3090 GPUs reported
Model Optimization
- LoRA
- FP16 inference/training
System Optimization
- Checkpoint selection by SA evaluation loss
Training Optimization
- Gradient accumulation (8 steps)
- AdamW optimizer with linear decay and 3% warmup
Reproducibility
Data Urls
- https://github.com/AI4Finance-Foundation/FinGPT
- FLUE benchmark (cited)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Experiments cover only ~7B models; results may not hold for much larger models.
- NER dataset is small (609 / 1003 samples), so NER results are noisy.
- Zero-shot tests exclude neutral labels; that narrows realism of real-world sentiment tasks.
- NER and RE were reformulated into classification (NER(CLS), RE(CLS)), which reduces realism for span extraction tasks.
When Not To Use
- For high-stakes finance tasks that need calibrated probabilities and strict factual recall.
- When you require span-level NER outputs (paper converts NER to classification).
- If you must support neutral sentiment in zero-shot scenarios (neutral excluded).
Failure Modes
- Hallucination on unseen tasks or when instruction diversity is low.
- Task interference in multi-task settings that can hurt classification accuracy.
- Bias from small or unbalanced financial datasets causing overfitting.
Core Entities
Models
- Llama2-7B
- Falcon-7B
- MPT-7B
- BLOOM-7.1B
- ChatGLM2-6B
- Qwen-7B
Metrics
- F1-score
- entity-level F1
- relation-only F1
Datasets
- FPB
- FiQA-SA
- TFNS
- NWGI
- Headline
- NER (domain)
- FinRED
- FLUE
Benchmarks
- FinGPT Benchmark (this paper)
- FLUE

