Overview
The work provides open datasets and a tuned 7B model with clear Spanish gains. Production readiness is moderate due to model size limits, weak summarization, and ethical risks in financial outputs.
Citations1
Evidence Strength0.70
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
Spanish is a large and growing financial-language market; a small, tuned bilingual model can beat generic SOTA on Spanish finance tasks, enabling better local analytics and customer support at lower compute cost.
Who Should Care
Summary TLDR
The authors release Toisón de Oro: a bilingual financial stack (FIT-ES instruction data ≈151k samples), a finetuned LLaMA2-7B model (FinMA-ES), and a bilingual evaluation suite (FLARE-ES, 21 datasets across 11 tasks). On the FLARE-ES benchmark FinMA-ES substantially improves Spanish financial task performance versus general SOTA (including GPT-4 on several Spanish datasets). The paper shows bilingual instruction tuning yields cross-lingual gains but notes model size limits and weak summarization results.
Problem Statement
Financial NLP has been dominated by English. Spanish finance data and tools are scarce, so off-the-shelf LLMs underperform on Spanish tasks. The paper builds bilingual training data, a finetuned Spanish–English financial LLM, and a bilingual benchmark to measure and reduce that gap.
Main Contribution
FIT-ES: a bilingual financial instruction-tuning dataset (reported ≈151k samples across 15 sources and 7 tasks).
FinMA-ES: finetuned LLaMA2-7B models (bilingual and Spanish-only) for financial tasks.
Key Findings
Authors assembled a bilingual instruction dataset for finance.
FinMA-ES (7B) outperforms GPT-4 on multiple Spanish financial tasks in FLARE-ES.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 0.99 (FinMA-ES-Bilingual) | GPT-4 0.60 | +0.39 | MultiFin (Spanish classification) | Table 3, MultiFin Acc | Table 3 |
| Accuracy | 0.84 (FinMA-ES-Bilingual) | GPT-4 0.27 | +0.57 | EFP (Spanish QA) | Table 3, EFP Acc | Table 3 |
What To Try In 7 Days
Run FinMA-ES on a small Spanish finance dataset to compare end-to-end accuracy versus your current LLM.
Add a few thousand domain-specific Spanish instruction examples and retune an existing 7B model to test quick gains.
Adopt FLARE-ES or FIT-ES subsets to benchmark multilingual performance before production rollout.
Optimization Features
Infra Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Model capped at 7B parameters due to compute limits, which may limit absolute capability.
Summarization tasks (FNS-2023 and others) show weak performance across most models.
When Not To Use
High-stakes automated trading decisions without human review.
As a drop-in replacement for large proprietary models on long-form summarization.
Failure Modes
Hallucinated or incorrect financial facts leading to wrong advice.
Poor label-sequence generation on complex summary or label tasks (ECTSum, FiNER-ORD).

