First open bilingual Spanish–English financial LLM, instruction data, and benchmark

February 12, 20247 min

Overview

Decision SnapshotNeeds Validation

The work provides open datasets and a tuned 7B model with clear Spanish gains. Production readiness is moderate due to model size limits, weak summarization, and ethical risks in financial outputs.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 60%

Authors

Xiao Zhang, Ruoyu Xiang, Chenhan Yuan, Duanyu Feng, Weiguang Han, Alejandro Lopez-Lira, Xiao-Yang Liu, Sophia Ananiadou, Min Peng, Jimin Huang, Qianqian Xie

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Spanish is a large and growing financial-language market; a small, tuned bilingual model can beat generic SOTA on Spanish finance tasks, enabling better local analytics and customer support at lower compute cost.

Who Should Care

Summary TLDR

The authors release Toisón de Oro: a bilingual financial stack (FIT-ES instruction data ≈151k samples), a finetuned LLaMA2-7B model (FinMA-ES), and a bilingual evaluation suite (FLARE-ES, 21 datasets across 11 tasks). On the FLARE-ES benchmark FinMA-ES substantially improves Spanish financial task performance versus general SOTA (including GPT-4 on several Spanish datasets). The paper shows bilingual instruction tuning yields cross-lingual gains but notes model size limits and weak summarization results.

Problem Statement

Financial NLP has been dominated by English. Spanish finance data and tools are scarce, so off-the-shelf LLMs underperform on Spanish tasks. The paper builds bilingual training data, a finetuned Spanish–English financial LLM, and a bilingual benchmark to measure and reduce that gap.

Main Contribution

FIT-ES: a bilingual financial instruction-tuning dataset (reported ≈151k samples across 15 sources and 7 tasks).

FinMA-ES: finetuned LLaMA2-7B models (bilingual and Spanish-only) for financial tasks.

Key Findings

Authors assembled a bilingual instruction dataset for finance.

Numbers≈151k instruction samples from 15 datasets

Practical UseYou can reuse FIT-ES to instruction-tune existing LLMs for Spanish and English financial tasks.

Evidence RefAbstract; Sec.3.1; Table 1

FinMA-ES (7B) outperforms GPT-4 on multiple Spanish financial tasks in FLARE-ES.

NumbersFinMA-ES Acc 0.99 on MultiFin vs GPT-4 0.60; 4 of 6 Spanish datasets

Practical UseFinetune a domain+language instruction set to beat larger generic models on niche non-English finance tasks.

Evidence RefTable 3; Sec.4.1.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.99 (FinMA-ES-Bilingual)GPT-4 0.60+0.39MultiFin (Spanish classification)Table 3, MultiFin AccTable 3
Accuracy0.84 (FinMA-ES-Bilingual)GPT-4 0.27+0.57EFP (Spanish QA)Table 3, EFP AccTable 3

What To Try In 7 Days

Run FinMA-ES on a small Spanish finance dataset to compare end-to-end accuracy versus your current LLM.

Add a few thousand domain-specific Spanish instruction examples and retune an existing 7B model to test quick gains.

Adopt FLARE-ES or FIT-ES subsets to benchmark multilingual performance before production rollout.

Optimization Features

Infra Optimization
trained on 2x NVIDIA HGX A100 80GB GPUs
Training Optimization
instruction tuningAdamW optimizer settings (lr 3e-4, 5 epochs, batch 1)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Model capped at 7B parameters due to compute limits, which may limit absolute capability.

Summarization tasks (FNS-2023 and others) show weak performance across most models.

When Not To Use

High-stakes automated trading decisions without human review.

As a drop-in replacement for large proprietary models on long-form summarization.

Failure Modes

Hallucinated or incorrect financial facts leading to wrong advice.

Poor label-sequence generation on complex summary or label tasks (ECTSum, FiNER-ORD).

Core Entities

Models

FinMA-ES-BilingualFinMA-ES-SpanishFinMA-7B-fullFinMA-30B-nlpLLaMA2-7BLLaMA2-13BGPT-4ChatGPTLince-zeroFalcon-7BBloomz-7B1-mt

Metrics

AccuracyF1Exact Match (EM)ROUGE (1/2/L)BERTScoreBARTScoreMatthews Correlation Coefficient (MCC)

Datasets

FIT-ESFLARE-ESMultiFinFNS-2023TSAFinanceESEFPEFPAFinQAConvFinQAFPBFiQA-SAHeadlinesFIN (NER)BigData22ACL18CIKM18FiNER-ORDECTSumEDTSumGermanCreditAustralianCreditFOMC

Benchmarks

FLARE-ES