FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Overview

Decision SnapshotReady For Pilot

The work supplies an end-to-end recipe (pretraining data, LoRA/QLoRA, DPO alignment, CLIP vision, retrieval) and quantitative comparisons, but results focus on the authors' benchmark set and some closed-source comparisons.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/9

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Gagan Bhatia, El Moatez Billah Nagoudi, Hasan Cavusoglu, Muhammad Abdul-Mageed

Links

Abstract / PDF / Code

Why It Matters For Business

A 7B finance-specialized LLM, paired with retrieval and tool pipes, can deliver near-GPT-4 accuracy on many finance tasks at lower model-cost and with controllable hallucination rates, enabling in-house deployment where data control and latency matter.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

FinTral is a family of finance-focused multimodal LLMs built on Mistral-7B and trained on FinSet (20B deduplicated financial tokens). The authors instruction-tune and align models with AI feedback (dDPO/DPO), add vision via CLIP, and enable tools + retrieval. The best variant (FinTral-DPO-T&R) achieves strong zero-shot results across nine financial tasks (avg ~0.70 on evaluated tasks), sharply reduces hallucinations (HI 0.97), and competes with GPT-4 on several benchmarks while remaining an open-source-sized 7B model.

Problem Statement

Existing LLMs struggle with finance-specific language, numbers, charts, and hallucinations. The paper targets a small (7B) multimodal model that better handles text, tables, numbers, and images for finance while reducing hallucinations and keeping compute reasonable.

Main Contribution

FinTral: a family of multimodal finance LLMs built from Mistral-7B with domain pretraining, instruction tuning, AI-feedback alignment, vision, tools, and retrieval.

FinSet: a cleaned, deduplicated finance pretraining corpus (20B tokens) plus large instruction and visual instruction sets for financial tasks.

Key Findings

FinTral-DPO-T&R reaches an average score of 0.70 on evaluated text tasks with tools and retrieval.

NumbersAvg 0.70 (Table 6)

Practical UseCombine instruction-aligned 7B models with retrieval and simple tools to approach closed-source GPT-4 performance at much lower model size and cost.

Evidence RefTable 6

Alignment with AI feedback (DPO/dDPO) raised average scores from 0.49 (FinTral-INST) to 0.59 (FinTral-DPO).

NumbersFinTral-INST 0.49 -> FinTral-DPO 0.59 (Table 5)

Practical UseUse preference-style alignment (DPO) on domain instruction data to notably improve instruction following and task accuracy.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average score (text tasks)	FinTral-INST 0.49	—	—	Table 5 (seven text clusters)	FinTral-INST average 0.49 across SA, NER, NU, TS, SMP, CS, FD	Table 5
Average score (text tasks)	FinTral-DPO 0.59	ChatGPT 0.53	+0.06 vs ChatGPT	Table 5	FinTral-DPO average 0.59, outperforms ChatGPT on these tasks	Table 5

What To Try In 7 Days

Run a small pilot: fine-tune a 7B model on your domain subset and evaluate FinTral-style HI on key terms.

Add a retrieval index (BGE or similar) over recent company filings and test question-answering with retrieval.

Instrument a simple tools layer (calculator functions) for numeric tasks and compare error rates vs plain LLM outputs.

Agent Features

Tool Use

Offloads math to tools (Add/Subtract/Multiply style)Uses retrieval to fetch supporting documents

Optimization Features

Token Efficiency

BPE tokenizer that segments numbers into single digits (helps numerical tasks)

Infra Optimization

Pretraining reported on four 40GB A100 GPUs

Model Optimization

LoRA

System Optimization

Sequence length up to 8k tokens for long documents

Training Optimization

dDPO / DPO alignment (no reward model)FlashAttention-2 for faster attention

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/UBC-NLP/fintral

Risks & Boundaries

Limitations

Domain-specific: optimized for finance and may underperform outside finance.

Data cutoff Aug 1, 2023 limits up-to-date market facts.

When Not To Use

For general-domain tasks outside finance where broad world knowledge is crucial.

When you need guaranteed live market data beyond the model's retrieval window without a live data feed.

Failure Modes

Format-sensitive tasks (numerical tables, NER) can still suffer from instruction-following errors.

Vision module lags behind top closed-source multimodal models on complex charts.

Core Entities

Models

FinTral-INSTFinTral-DPOFinTral-DPO-T&RFinTral-VLMistral-7B-v0.1FinMA-7BLLaMa-7BchatChatGPT (gpt-3.5-turbo)GPT-4 (gpt-4-0613)GPT-4-Turbo (gpt-4-1106-preview)

Metrics

AccuracyAverage task scoreEntity-F1Exact MatchRouge-scoreHallucination Index (HI)

Datasets

FinSetFinVis-PTFinVis-ITFinVQAFinVQAv2ChartQAFinTerms-MCQFinTerms-GenFinanceBench (sample)FinQAConvFinQAFiner-Ord / FiNERFiQA-SAFPBECTSUMEDTSUMBigData22ACL18CIKM18GermanCreditAustralianCredit

Benchmarks

FinSet benchmark (9 tasks, 23+ datasets)FinVQA / ChartQA (chart understanding)FinanceBench (open-book QA sample)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FinTral-DPO-T&R reaches an average score of 0.70 on evaluated text tasks with tools and retrieval.

Alignment with AI feedback (DPO/dDPO) raised average scores from 0.49 (FinTral-INST) to 0.59 (FinTral-DPO).

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-