Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
A 7B finance-specialized LLM, paired with retrieval and tool pipes, can deliver near-GPT-4 accuracy on many finance tasks at lower model-cost and with controllable hallucination rates, enabling in-house deployment where data control and latency matter.
Summary TLDR
FinTral is a family of finance-focused multimodal LLMs built on Mistral-7B and trained on FinSet (20B deduplicated financial tokens). The authors instruction-tune and align models with AI feedback (dDPO/DPO), add vision via CLIP, and enable tools + retrieval. The best variant (FinTral-DPO-T&R) achieves strong zero-shot results across nine financial tasks (avg ~0.70 on evaluated tasks), sharply reduces hallucinations (HI 0.97), and competes with GPT-4 on several benchmarks while remaining an open-source-sized 7B model.
Problem Statement
Existing LLMs struggle with finance-specific language, numbers, charts, and hallucinations. The paper targets a small (7B) multimodal model that better handles text, tables, numbers, and images for finance while reducing hallucinations and keeping compute reasonable.
Main Contribution
FinTral: a family of multimodal finance LLMs built from Mistral-7B with domain pretraining, instruction tuning, AI-feedback alignment, vision, tools, and retrieval.
FinSet: a cleaned, deduplicated finance pretraining corpus (20B tokens) plus large instruction and visual instruction sets for financial tasks.
Empirical benchmark across 9 financial task groups (23+ datasets) showing FinTral variants outperform many open models and approach or match GPT-4 on several tasks.
Practical recipe: LoRA/QLoRA pretraining/finetuning, dDPO/DPO for alignment, CLIP vision encoder, BGE retrieval, and simple tools for math.
Key Findings
FinTral-DPO-T&R reaches an average score of 0.70 on evaluated text tasks with tools and retrieval.
Alignment with AI feedback (DPO/dDPO) raised average scores from 0.49 (FinTral-INST) to 0.59 (FinTral-DPO).
FinTral-DPO-T&R achieves a Hallucination Index (HI) of 0.97, near GPT-4-Turbo's 0.98 and above ChatGPT's 0.95 on defined-term multiple-choice.
FinTral-VL (vision-enabled) scores 0.63 on ChartQA and 0.75 on FinVQA; GPT-4V scores are higher (0.79, 0.89).
Results
Average score (text tasks)
Average score (text tasks)
Average score with tools+retrieval
Hallucination Index (HI)
Multimodal chart QA
Pretraining data size
Instruction tuning size
Visual instruction size
Pretraining compute
Who Should Care
What To Try In 7 Days
Run a small pilot: fine-tune a 7B model on your domain subset and evaluate FinTral-style HI on key terms.
Add a retrieval index (BGE or similar) over recent company filings and test question-answering with retrieval.
Instrument a simple tools layer (calculator functions) for numeric tasks and compare error rates vs plain LLM outputs.
Agent Features
Tool Use
- Offloads math to tools (Add/Subtract/Multiply style)
- Uses retrieval to fetch supporting documents
Optimization Features
Token Efficiency
- BPE tokenizer that segments numbers into single digits (helps numerical tasks)
Infra Optimization
- Pretraining reported on four 40GB A100 GPUs
Model Optimization
- LoRA
System Optimization
- Sequence length up to 8k tokens for long documents
Training Optimization
- dDPO / DPO alignment (no reward model)
- FlashAttention-2 for faster attention
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Domain-specific: optimized for finance and may underperform outside finance.
- Data cutoff Aug 1, 2023 limits up-to-date market facts.
- Energy and compute: pretraining and tuning require substantial GPU time (80h on 4xA100 for pretraining step reported).
- Partial openness: authors plan responsible releases, but dataset/model release is not fully open yet.
When Not To Use
- For general-domain tasks outside finance where broad world knowledge is crucial.
- When you need guaranteed live market data beyond the model's retrieval window without a live data feed.
- If strict open-source licensing and immediate dataset release are required.
Failure Modes
- Format-sensitive tasks (numerical tables, NER) can still suffer from instruction-following errors.
- Vision module lags behind top closed-source multimodal models on complex charts.
- Retrieval dependence: incorrect or stale index documents can lead to plausible but incorrect responses.
Core Entities
Models
- FinTral-INST
- FinTral-DPO
- FinTral-DPO-T&R
- FinTral-VL
- Mistral-7B-v0.1
- FinMA-7B
- LLaMa-7Bchat
- ChatGPT (gpt-3.5-turbo)
- GPT-4 (gpt-4-0613)
- GPT-4-Turbo (gpt-4-1106-preview)
Metrics
- Accuracy
- Average task score
- Entity-F1
- Exact Match
- Rouge-score
- Hallucination Index (HI)
Datasets
- FinSet
- FinVis-PT
- FinVis-IT
- FinVQA
- FinVQAv2
- ChartQA
- FinTerms-MCQ
- FinTerms-Gen
- FinanceBench (sample)
- FinQA
- ConvFinQA
- Finer-Ord / FiNER
- FiQA-SA
- FPB
- ECTSUM
- EDTSUM
- BigData22
- ACL18
- CIKM18
- GermanCredit
- AustralianCredit
Benchmarks
- FinSet benchmark (9 tasks, 23+ datasets)
- FinVQA / ChartQA (chart understanding)
- FinanceBench (open-book QA sample)

