First open bilingual Spanish–English financial LLM, instruction data, and benchmark

February 12, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Xiao Zhang, Ruoyu Xiang, Chenhan Yuan, Duanyu Feng, Weiguang Han, Alejandro Lopez-Lira, Xiao-Yang Liu, Sophia Ananiadou, Min Peng, Jimin Huang, Qianqian Xie

Links

Abstract / PDF

Why It Matters For Business

Spanish is a large and growing financial-language market; a small, tuned bilingual model can beat generic SOTA on Spanish finance tasks, enabling better local analytics and customer support at lower compute cost.

Summary TLDR

The authors release Toisón de Oro: a bilingual financial stack (FIT-ES instruction data ≈151k samples), a finetuned LLaMA2-7B model (FinMA-ES), and a bilingual evaluation suite (FLARE-ES, 21 datasets across 11 tasks). On the FLARE-ES benchmark FinMA-ES substantially improves Spanish financial task performance versus general SOTA (including GPT-4 on several Spanish datasets). The paper shows bilingual instruction tuning yields cross-lingual gains but notes model size limits and weak summarization results.

Problem Statement

Financial NLP has been dominated by English. Spanish finance data and tools are scarce, so off-the-shelf LLMs underperform on Spanish tasks. The paper builds bilingual training data, a finetuned Spanish–English financial LLM, and a bilingual benchmark to measure and reduce that gap.

Main Contribution

FIT-ES: a bilingual financial instruction-tuning dataset (reported ≈151k samples across 15 sources and 7 tasks).

FinMA-ES: finetuned LLaMA2-7B models (bilingual and Spanish-only) for financial tasks.

FLARE-ES: an open bilingual benchmark (21 datasets, 11 tasks) for cross-lingual financial evaluation.

Evidence that bilingual instruction tuning boosts Spanish performance and transfers benefits back to English.

Open release of datasets, models, and evaluation code to enable follow-up work.

Key Findings

Authors assembled a bilingual instruction dataset for finance.

Numbers≈151k instruction samples from 15 datasets

FinMA-ES (7B) outperforms GPT-4 on multiple Spanish financial tasks in FLARE-ES.

NumbersFinMA-ES Acc 0.99 on MultiFin vs GPT-4 0.60; 4 of 6 Spanish datasets

Bilingual tuning improves cross-lingual performance versus Spanish-only tuning.

NumbersBilingual > Spanish-only on 3/6 Spanish datasets and 6/9 English datasets

Large LLMs show a clear Spanish vs English performance gap on finance tasks.

NumbersMany baselines (GPT-4, ChatGPT) have much lower Spanish accuracy than FinMA-ES (examples in Table 3)

Results

Accuracy

Value0.99 (FinMA-ES-Bilingual)

BaselineGPT-4 0.60

Accuracy

Value0.84 (FinMA-ES-Bilingual)

BaselineGPT-4 0.27

Accuracy

Value0.99 (FinMA-ES-Bilingual)

BaselineGPT-4 0.34

Accuracy

Value0.85 (FinMA-ES-Bilingual)

BaselineGPT-4 0.47

ROUGE-L

Value0.13 (GPT-4 best)

BaselineFinMA-ES 0.00

Who Should Care

What To Try In 7 Days

Run FinMA-ES on a small Spanish finance dataset to compare end-to-end accuracy versus your current LLM.

Add a few thousand domain-specific Spanish instruction examples and retune an existing 7B model to test quick gains.

Adopt FLARE-ES or FIT-ES subsets to benchmark multilingual performance before production rollout.

Optimization Features

Infra Optimization

  • trained on 2x NVIDIA HGX A100 80GB GPUs

Training Optimization

  • instruction tuning
  • AdamW optimizer settings (lr 3e-4, 5 epochs, batch 1)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Model capped at 7B parameters due to compute limits, which may limit absolute capability.
  • Summarization tasks (FNS-2023 and others) show weak performance across most models.
  • Evaluation focuses on Spanish and English; other languages not covered.
  • Potential for incorrect financial outputs; recommended for research use and human oversight.

When Not To Use

  • High-stakes automated trading decisions without human review.
  • As a drop-in replacement for large proprietary models on long-form summarization.
  • For languages beyond Spanish and English without retraining.

Failure Modes

  • Hallucinated or incorrect financial facts leading to wrong advice.
  • Poor label-sequence generation on complex summary or label tasks (ECTSum, FiNER-ORD).
  • Performance drops on long-document summarization and some leave-out datasets.
  • Bias toward patterns in training data; language coverage gaps.

Core Entities

Models

  • FinMA-ES-Bilingual
  • FinMA-ES-Spanish
  • FinMA-7B-full
  • FinMA-30B-nlp
  • LLaMA2-7B
  • LLaMA2-13B
  • GPT-4
  • ChatGPT
  • Lince-zero
  • Falcon-7B
  • Bloomz-7B1-mt

Metrics

  • Accuracy
  • F1
  • Exact Match (EM)
  • ROUGE (1/2/L)
  • BERTScore
  • BARTScore
  • Matthews Correlation Coefficient (MCC)

Datasets

  • FIT-ES
  • FLARE-ES
  • MultiFin
  • FNS-2023
  • TSA
  • FinanceES
  • EFP
  • EFPA
  • FinQA
  • ConvFinQA
  • FPB
  • FiQA-SA
  • Headlines
  • FIN (NER)
  • BigData22
  • ACL18
  • CIKM18
  • FiNER-ORD
  • ECTSum
  • EDTSum
  • GermanCredit
  • AustralianCredit
  • FOMC

Benchmarks

  • FLARE-ES