Overview
FinMA and FIT provide practical, open resources for finance tasks and improve NLP results, but numeric QA and prediction remain weak; further task-specific engineering is required before production trading use.
Citations43
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 11/11
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Open domain-tuned models and labeled instruction data lower the bar to build finance-specific AI: cheaper customization, reproducible evaluation, and better performance on common text tasks; numeric QA and trading signals still need extra work.
Who Should Care
Summary TLDR
PIXIU bundles three open resources for finance NLP and prediction: FIT (136K multi-task instruction samples), FLARE (a 5-task financial benchmark), and FinMA (an instruction‑tuned LLaMA-based financial LLM). FinMA (7B/30B) wins on several financial NLP tasks vs. general LLMs but fails at numeric-heavy QA and shows weak, inconsistent performance on stock-movement prediction. All code, model weights, data, and benchmark are released to speed practical work on finance-specific LLMs.
Problem Statement
There was no open, instruction‑tuned large language model, no large instruction dataset, and no holistic benchmark that includes both financial language tasks and stock movement prediction. This gap slows reproducible progress on finance-focused LLMs.
Main Contribution
FIT: a 136,609-sample, multi-task, multi-modal instruction tuning dataset for finance.
FLARE: an evaluation benchmark covering 4 NLP tasks (6 datasets) and 1 prediction task (3 datasets).
Key Findings
They built FIT with 136,609 instruction‑tuning examples across 5 tasks and 9 datasets.
FinMA outperforms general LLMs on several financial NLP tasks after instruction tuning.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | FinMA-30B 0.87 | ChatGPT 0.78 | +0.09 vs ChatGPT | FPB test | Table 5 shows FinMA-30B Acc 0.87 and ChatGPT Acc 0.78 | Table 5 |
| FPB F1 | FinMA-30B 0.88 | GPT-4 0.78 | +0.10 vs GPT-4 | FPB test | Table 5 reports F1 scores | Table 5 |
What To Try In 7 Days
Fine-tune a small LLaMA checkpoint with FIT on your firm’s sentiment or headline data and compare to off-the-shelf ChatGPT on held-out labels.
Run FLARE evaluation on your current models to get reproducible task-by-task gaps (sentiment, NER, QA, prediction).
If you need numeric QA, add targeted numeric reasoning data or integrate symbolic calculators rather than relying on base FinMA.
Optimization Features
Infra Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
FinMA released only up to 30B; 30B was not fine-tuned on the full dataset due to compute limits.
Backbone (LLaMA) shows poor numeric and mathematical reasoning, hurting financial QA.
When Not To Use
For high-stakes quantitative financial question answering without additional numeric modules.
As a sole trading signal generator; stock prediction performance is weak and inconsistent.
Failure Modes
Hallucinated or incorrect numeric answers in financial QA.
Near-zero predictive correlation on some stock datasets leading to useless trading signals.

