Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
Spanish is a large and growing financial-language market; a small, tuned bilingual model can beat generic SOTA on Spanish finance tasks, enabling better local analytics and customer support at lower compute cost.
Summary TLDR
The authors release Toisón de Oro: a bilingual financial stack (FIT-ES instruction data ≈151k samples), a finetuned LLaMA2-7B model (FinMA-ES), and a bilingual evaluation suite (FLARE-ES, 21 datasets across 11 tasks). On the FLARE-ES benchmark FinMA-ES substantially improves Spanish financial task performance versus general SOTA (including GPT-4 on several Spanish datasets). The paper shows bilingual instruction tuning yields cross-lingual gains but notes model size limits and weak summarization results.
Problem Statement
Financial NLP has been dominated by English. Spanish finance data and tools are scarce, so off-the-shelf LLMs underperform on Spanish tasks. The paper builds bilingual training data, a finetuned Spanish–English financial LLM, and a bilingual benchmark to measure and reduce that gap.
Main Contribution
FIT-ES: a bilingual financial instruction-tuning dataset (reported ≈151k samples across 15 sources and 7 tasks).
FinMA-ES: finetuned LLaMA2-7B models (bilingual and Spanish-only) for financial tasks.
FLARE-ES: an open bilingual benchmark (21 datasets, 11 tasks) for cross-lingual financial evaluation.
Evidence that bilingual instruction tuning boosts Spanish performance and transfers benefits back to English.
Open release of datasets, models, and evaluation code to enable follow-up work.
Key Findings
Authors assembled a bilingual instruction dataset for finance.
FinMA-ES (7B) outperforms GPT-4 on multiple Spanish financial tasks in FLARE-ES.
Bilingual tuning improves cross-lingual performance versus Spanish-only tuning.
Large LLMs show a clear Spanish vs English performance gap on finance tasks.
Results
Accuracy
Accuracy
Accuracy
Accuracy
ROUGE-L
Who Should Care
What To Try In 7 Days
Run FinMA-ES on a small Spanish finance dataset to compare end-to-end accuracy versus your current LLM.
Add a few thousand domain-specific Spanish instruction examples and retune an existing 7B model to test quick gains.
Adopt FLARE-ES or FIT-ES subsets to benchmark multilingual performance before production rollout.
Optimization Features
Infra Optimization
- trained on 2x NVIDIA HGX A100 80GB GPUs
Training Optimization
- instruction tuning
- AdamW optimizer settings (lr 3e-4, 5 epochs, batch 1)
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Model capped at 7B parameters due to compute limits, which may limit absolute capability.
- Summarization tasks (FNS-2023 and others) show weak performance across most models.
- Evaluation focuses on Spanish and English; other languages not covered.
- Potential for incorrect financial outputs; recommended for research use and human oversight.
When Not To Use
- High-stakes automated trading decisions without human review.
- As a drop-in replacement for large proprietary models on long-form summarization.
- For languages beyond Spanish and English without retraining.
Failure Modes
- Hallucinated or incorrect financial facts leading to wrong advice.
- Poor label-sequence generation on complex summary or label tasks (ECTSum, FiNER-ORD).
- Performance drops on long-document summarization and some leave-out datasets.
- Bias toward patterns in training data; language coverage gaps.
Core Entities
Models
- FinMA-ES-Bilingual
- FinMA-ES-Spanish
- FinMA-7B-full
- FinMA-30B-nlp
- LLaMA2-7B
- LLaMA2-13B
- GPT-4
- ChatGPT
- Lince-zero
- Falcon-7B
- Bloomz-7B1-mt
Metrics
- Accuracy
- F1
- Exact Match (EM)
- ROUGE (1/2/L)
- BERTScore
- BARTScore
- Matthews Correlation Coefficient (MCC)
Datasets
- FIT-ES
- FLARE-ES
- MultiFin
- FNS-2023
- TSA
- FinanceES
- EFP
- EFPA
- FinQA
- ConvFinQA
- FPB
- FiQA-SA
- Headlines
- FIN (NER)
- BigData22
- ACL18
- CIKM18
- FiNER-ORD
- ECTSum
- EDTSum
- GermanCredit
- AustralianCredit
- FOMC
Benchmarks
- FLARE-ES

