PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

Overview

Decision SnapshotNeeds Validation

FinMA and FIT provide practical, open resources for finance tasks and improve NLP results, but numeric QA and prediction remain weak; further task-specific engineering is required before production trading use.

Citations43

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 11/11

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, Jimin Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Open domain-tuned models and labeled instruction data lower the bar to build finance-specific AI: cheaper customization, reproducible evaluation, and better performance on common text tasks; numeric QA and trading signals still need extra work.

Who Should Care

ML Engineer Data Scientist CTO Product Manager

Summary TLDR

PIXIU bundles three open resources for finance NLP and prediction: FIT (136K multi-task instruction samples), FLARE (a 5-task financial benchmark), and FinMA (an instruction‑tuned LLaMA-based financial LLM). FinMA (7B/30B) wins on several financial NLP tasks vs. general LLMs but fails at numeric-heavy QA and shows weak, inconsistent performance on stock-movement prediction. All code, model weights, data, and benchmark are released to speed practical work on finance-specific LLMs.

Problem Statement

There was no open, instruction‑tuned large language model, no large instruction dataset, and no holistic benchmark that includes both financial language tasks and stock movement prediction. This gap slows reproducible progress on finance-focused LLMs.

Main Contribution

FIT: a 136,609-sample, multi-task, multi-modal instruction tuning dataset for finance.

FLARE: an evaluation benchmark covering 4 NLP tasks (6 datasets) and 1 prediction task (3 datasets).

Key Findings

They built FIT with 136,609 instruction‑tuning examples across 5 tasks and 9 datasets.

Numbers136,609 samples; 5 tasks; 9 datasets

Practical UseYou can fine-tune LLaMA-style models for many finance tasks without creating prompts from scratch.

Evidence RefAbstract, Sec.3, Table 2

FinMA outperforms general LLMs on several financial NLP tasks after instruction tuning.

NumbersFPB F1: FinMA-30B 0.88 vs GPT-4 0.78 (+0.10 F1)

Practical UseIf you need accurate financial sentiment or headline classification, use an instruction‑tuned domain model rather than a general LLM.

Evidence RefTable 5; Sec.6.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	FinMA-30B 0.87	ChatGPT 0.78	+0.09 vs ChatGPT	FPB test	Table 5 shows FinMA-30B Acc 0.87 and ChatGPT Acc 0.78	Table 5
FPB F1	FinMA-30B 0.88	GPT-4 0.78	+0.10 vs GPT-4	FPB test	Table 5 reports F1 scores	Table 5

What To Try In 7 Days

Fine-tune a small LLaMA checkpoint with FIT on your firm’s sentiment or headline data and compare to off-the-shelf ChatGPT on held-out labels.

Run FLARE evaluation on your current models to get reproducible task-by-task gaps (sentiment, NER, QA, prediction).

If you need numeric QA, add targeted numeric reasoning data or integrate symbolic calculators rather than relying on base FinMA.

Optimization Features

Infra Optimization

FinMA-7B trained on 8x A100-40GB; FinMA-30B on 128x A100-40GB

Training Optimization

Instruction tuning on 136K examplesFine-tuned LLaMA checkpoints (7B, 30B)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/chancefocus/PIXIU

Data URLs

https://github.com/chancefocus/PIXIU

Risks & Boundaries

Limitations

FinMA released only up to 30B; 30B was not fine-tuned on the full dataset due to compute limits.

Backbone (LLaMA) shows poor numeric and mathematical reasoning, hurting financial QA.

When Not To Use

For high-stakes quantitative financial question answering without additional numeric modules.

As a sole trading signal generator; stock prediction performance is weak and inconsistent.

Failure Modes

Hallucinated or incorrect numeric answers in financial QA.

Near-zero predictive correlation on some stock datasets leading to useless trading signals.

Core Entities

Models

FinMA-7BFinMA-30BFinMA-7B-fullLLaMABloombergGPTGPT-4ChatGPTBLOOMGPT-NeoXOPT-66BVicuna-13B

Metrics

AccuracyF1Avg F1Entity F1Exact Match (EM)Matthews Correlation Coefficient (MCC)

Datasets

FITFLAREFPBFiQA-SAHeadlineFIN (NER)FinQAConvFinQABigData22ACL18CIKM18

Benchmarks

FLAREFLUE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

They built FIT with 136,609 instruction‑tuning examples across 5 tasks and 9 datasets.

FinMA outperforms general LLMs on several financial NLP tasks after instruction tuning.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding