First open bilingual Spanish–English financial LLM, instruction data, and benchmark

Overview

Decision SnapshotNeeds Validation

The work provides open datasets and a tuned 7B model with clear Spanish gains. Production readiness is moderate due to model size limits, weak summarization, and ethical risks in financial outputs.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 60%

Authors

Xiao Zhang, Ruoyu Xiang, Chenhan Yuan, Duanyu Feng, Weiguang Han, Alejandro Lopez-Lira, Xiao-Yang Liu, Sophia Ananiadou, Min Peng, Jimin Huang, Qianqian Xie

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Spanish is a large and growing financial-language market; a small, tuned bilingual model can beat generic SOTA on Spanish finance tasks, enabling better local analytics and customer support at lower compute cost.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

The authors release Toisón de Oro: a bilingual financial stack (FIT-ES instruction data ≈151k samples), a finetuned LLaMA2-7B model (FinMA-ES), and a bilingual evaluation suite (FLARE-ES, 21 datasets across 11 tasks). On the FLARE-ES benchmark FinMA-ES substantially improves Spanish financial task performance versus general SOTA (including GPT-4 on several Spanish datasets). The paper shows bilingual instruction tuning yields cross-lingual gains but notes model size limits and weak summarization results.

Problem Statement

Financial NLP has been dominated by English. Spanish finance data and tools are scarce, so off-the-shelf LLMs underperform on Spanish tasks. The paper builds bilingual training data, a finetuned Spanish–English financial LLM, and a bilingual benchmark to measure and reduce that gap.

Main Contribution

FIT-ES: a bilingual financial instruction-tuning dataset (reported ≈151k samples across 15 sources and 7 tasks).

FinMA-ES: finetuned LLaMA2-7B models (bilingual and Spanish-only) for financial tasks.

Key Findings

Authors assembled a bilingual instruction dataset for finance.

Numbers≈151k instruction samples from 15 datasets

Practical UseYou can reuse FIT-ES to instruction-tune existing LLMs for Spanish and English financial tasks.

Evidence RefAbstract; Sec.3.1; Table 1

FinMA-ES (7B) outperforms GPT-4 on multiple Spanish financial tasks in FLARE-ES.

NumbersFinMA-ES Acc 0.99 on MultiFin vs GPT-4 0.60; 4 of 6 Spanish datasets

Practical UseFinetune a domain+language instruction set to beat larger generic models on niche non-English finance tasks.

Evidence RefTable 3; Sec.4.1.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.99 (FinMA-ES-Bilingual)	GPT-4 0.60	+0.39	MultiFin (Spanish classification)	Table 3, MultiFin Acc	Table 3
Accuracy	0.84 (FinMA-ES-Bilingual)	GPT-4 0.27	+0.57	EFP (Spanish QA)	Table 3, EFP Acc	Table 3

What To Try In 7 Days

Run FinMA-ES on a small Spanish finance dataset to compare end-to-end accuracy versus your current LLM.

Add a few thousand domain-specific Spanish instruction examples and retune an existing 7B model to test quick gains.

Adopt FLARE-ES or FIT-ES subsets to benchmark multilingual performance before production rollout.

Optimization Features

Infra Optimization

trained on 2x NVIDIA HGX A100 80GB GPUs

Training Optimization

instruction tuningAdamW optimizer settings (lr 3e-4, 5 epochs, batch 1)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/chancefocus/PIXIU

Data URLs

https://github.com/chancefocus/PIXIU

Risks & Boundaries

Limitations

Model capped at 7B parameters due to compute limits, which may limit absolute capability.

Summarization tasks (FNS-2023 and others) show weak performance across most models.

When Not To Use

High-stakes automated trading decisions without human review.

As a drop-in replacement for large proprietary models on long-form summarization.

Failure Modes

Hallucinated or incorrect financial facts leading to wrong advice.

Poor label-sequence generation on complex summary or label tasks (ECTSum, FiNER-ORD).

Core Entities

Models

FinMA-ES-BilingualFinMA-ES-SpanishFinMA-7B-fullFinMA-30B-nlpLLaMA2-7BLLaMA2-13BGPT-4ChatGPTLince-zeroFalcon-7BBloomz-7B1-mt

Metrics

AccuracyF1Exact Match (EM)ROUGE (1/2/L)BERTScoreBARTScoreMatthews Correlation Coefficient (MCC)

Datasets

FIT-ESFLARE-ESMultiFinFNS-2023TSAFinanceESEFPEFPAFinQAConvFinQAFPBFiQA-SAHeadlinesFIN (NER)BigData22ACL18CIKM18FiNER-ORDECTSumEDTSumGermanCreditAustralianCreditFOMC

Benchmarks

FLARE-ES

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Authors assembled a bilingual instruction dataset for finance.

FinMA-ES (7B) outperforms GPT-4 on multiple Spanish financial tasks in FLARE-ES.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding