PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

June 8, 20237 min

Overview

Decision SnapshotNeeds Validation

FinMA and FIT provide practical, open resources for finance tasks and improve NLP results, but numeric QA and prediction remain weak; further task-specific engineering is required before production trading use.

Citations43

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 11/11

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, Jimin Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Open domain-tuned models and labeled instruction data lower the bar to build finance-specific AI: cheaper customization, reproducible evaluation, and better performance on common text tasks; numeric QA and trading signals still need extra work.

Who Should Care

Summary TLDR

PIXIU bundles three open resources for finance NLP and prediction: FIT (136K multi-task instruction samples), FLARE (a 5-task financial benchmark), and FinMA (an instruction‑tuned LLaMA-based financial LLM). FinMA (7B/30B) wins on several financial NLP tasks vs. general LLMs but fails at numeric-heavy QA and shows weak, inconsistent performance on stock-movement prediction. All code, model weights, data, and benchmark are released to speed practical work on finance-specific LLMs.

Problem Statement

There was no open, instruction‑tuned large language model, no large instruction dataset, and no holistic benchmark that includes both financial language tasks and stock movement prediction. This gap slows reproducible progress on finance-focused LLMs.

Main Contribution

FIT: a 136,609-sample, multi-task, multi-modal instruction tuning dataset for finance.

FLARE: an evaluation benchmark covering 4 NLP tasks (6 datasets) and 1 prediction task (3 datasets).

Key Findings

They built FIT with 136,609 instruction‑tuning examples across 5 tasks and 9 datasets.

Numbers136,609 samples; 5 tasks; 9 datasets

Practical UseYou can fine-tune LLaMA-style models for many finance tasks without creating prompts from scratch.

Evidence RefAbstract, Sec.3, Table 2

FinMA outperforms general LLMs on several financial NLP tasks after instruction tuning.

NumbersFPB F1: FinMA-30B 0.88 vs GPT-4 0.78 (+0.10 F1)

Practical UseIf you need accurate financial sentiment or headline classification, use an instruction‑tuned domain model rather than a general LLM.

Evidence RefTable 5; Sec.6.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyFinMA-30B 0.87ChatGPT 0.78+0.09 vs ChatGPTFPB testTable 5 shows FinMA-30B Acc 0.87 and ChatGPT Acc 0.78Table 5
FPB F1FinMA-30B 0.88GPT-4 0.78+0.10 vs GPT-4FPB testTable 5 reports F1 scoresTable 5

What To Try In 7 Days

Fine-tune a small LLaMA checkpoint with FIT on your firm’s sentiment or headline data and compare to off-the-shelf ChatGPT on held-out labels.

Run FLARE evaluation on your current models to get reproducible task-by-task gaps (sentiment, NER, QA, prediction).

If you need numeric QA, add targeted numeric reasoning data or integrate symbolic calculators rather than relying on base FinMA.

Optimization Features

Infra Optimization
FinMA-7B trained on 8x A100-40GB; FinMA-30B on 128x A100-40GB
Training Optimization
Instruction tuning on 136K examplesFine-tuned LLaMA checkpoints (7B, 30B)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

FinMA released only up to 30B; 30B was not fine-tuned on the full dataset due to compute limits.

Backbone (LLaMA) shows poor numeric and mathematical reasoning, hurting financial QA.

When Not To Use

For high-stakes quantitative financial question answering without additional numeric modules.

As a sole trading signal generator; stock prediction performance is weak and inconsistent.

Failure Modes

Hallucinated or incorrect numeric answers in financial QA.

Near-zero predictive correlation on some stock datasets leading to useless trading signals.

Core Entities

Models

FinMA-7BFinMA-30BFinMA-7B-fullLLaMABloombergGPTGPT-4ChatGPTBLOOMGPT-NeoXOPT-66BVicuna-13B

Metrics

AccuracyF1Avg F1Entity F1Exact Match (EM)Matthews Correlation Coefficient (MCC)

Datasets

FITFLAREFPBFiQA-SAHeadlineFIN (NER)FinQAConvFinQABigData22ACL18CIKM18

Benchmarks

FLAREFLUE