FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

February 16, 20248 min

Overview

Decision SnapshotReady For Pilot

The work supplies an end-to-end recipe (pretraining data, LoRA/QLoRA, DPO alignment, CLIP vision, retrieval) and quantitative comparisons, but results focus on the authors' benchmark set and some closed-source comparisons.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/9

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Gagan Bhatia, El Moatez Billah Nagoudi, Hasan Cavusoglu, Muhammad Abdul-Mageed

Links

Abstract / PDF / Code

Why It Matters For Business

A 7B finance-specialized LLM, paired with retrieval and tool pipes, can deliver near-GPT-4 accuracy on many finance tasks at lower model-cost and with controllable hallucination rates, enabling in-house deployment where data control and latency matter.

Who Should Care

Summary TLDR

FinTral is a family of finance-focused multimodal LLMs built on Mistral-7B and trained on FinSet (20B deduplicated financial tokens). The authors instruction-tune and align models with AI feedback (dDPO/DPO), add vision via CLIP, and enable tools + retrieval. The best variant (FinTral-DPO-T&R) achieves strong zero-shot results across nine financial tasks (avg ~0.70 on evaluated tasks), sharply reduces hallucinations (HI 0.97), and competes with GPT-4 on several benchmarks while remaining an open-source-sized 7B model.

Problem Statement

Existing LLMs struggle with finance-specific language, numbers, charts, and hallucinations. The paper targets a small (7B) multimodal model that better handles text, tables, numbers, and images for finance while reducing hallucinations and keeping compute reasonable.

Main Contribution

FinTral: a family of multimodal finance LLMs built from Mistral-7B with domain pretraining, instruction tuning, AI-feedback alignment, vision, tools, and retrieval.

FinSet: a cleaned, deduplicated finance pretraining corpus (20B tokens) plus large instruction and visual instruction sets for financial tasks.

Key Findings

FinTral-DPO-T&R reaches an average score of 0.70 on evaluated text tasks with tools and retrieval.

NumbersAvg 0.70 (Table 6)

Practical UseCombine instruction-aligned 7B models with retrieval and simple tools to approach closed-source GPT-4 performance at much lower model size and cost.

Evidence RefTable 6

Alignment with AI feedback (DPO/dDPO) raised average scores from 0.49 (FinTral-INST) to 0.59 (FinTral-DPO).

NumbersFinTral-INST 0.49 -> FinTral-DPO 0.59 (Table 5)

Practical UseUse preference-style alignment (DPO) on domain instruction data to notably improve instruction following and task accuracy.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average score (text tasks)FinTral-INST 0.49Table 5 (seven text clusters)FinTral-INST average 0.49 across SA, NER, NU, TS, SMP, CS, FDTable 5
Average score (text tasks)FinTral-DPO 0.59ChatGPT 0.53+0.06 vs ChatGPTTable 5FinTral-DPO average 0.59, outperforms ChatGPT on these tasksTable 5

What To Try In 7 Days

Run a small pilot: fine-tune a 7B model on your domain subset and evaluate FinTral-style HI on key terms.

Add a retrieval index (BGE or similar) over recent company filings and test question-answering with retrieval.

Instrument a simple tools layer (calculator functions) for numeric tasks and compare error rates vs plain LLM outputs.

Agent Features

Tool Use
Offloads math to tools (Add/Subtract/Multiply style)Uses retrieval to fetch supporting documents

Optimization Features

Token Efficiency
BPE tokenizer that segments numbers into single digits (helps numerical tasks)
Infra Optimization
Pretraining reported on four 40GB A100 GPUs
Model Optimization
LoRA
System Optimization
Sequence length up to 8k tokens for long documents
Training Optimization
dDPO / DPO alignment (no reward model)FlashAttention-2 for faster attention

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Domain-specific: optimized for finance and may underperform outside finance.

Data cutoff Aug 1, 2023 limits up-to-date market facts.

When Not To Use

For general-domain tasks outside finance where broad world knowledge is crucial.

When you need guaranteed live market data beyond the model's retrieval window without a live data feed.

Failure Modes

Format-sensitive tasks (numerical tables, NER) can still suffer from instruction-following errors.

Vision module lags behind top closed-source multimodal models on complex charts.

Core Entities

Models

FinTral-INSTFinTral-DPOFinTral-DPO-T&RFinTral-VLMistral-7B-v0.1FinMA-7BLLaMa-7BchatChatGPT (gpt-3.5-turbo)GPT-4 (gpt-4-0613)GPT-4-Turbo (gpt-4-1106-preview)

Metrics

AccuracyAverage task scoreEntity-F1Exact MatchRouge-scoreHallucination Index (HI)

Datasets

FinSetFinVis-PTFinVis-ITFinVQAFinVQAv2ChartQAFinTerms-MCQFinTerms-GenFinanceBench (sample)FinQAConvFinQAFiner-Ord / FiNERFiQA-SAFPBECTSUMEDTSUMBigData22ACL18CIKM18GermanCreditAustralianCredit

Benchmarks

FinSet benchmark (9 tasks, 23+ datasets)FinVQA / ChartQA (chart understanding)FinanceBench (open-book QA sample)