FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

February 16, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

3

Authors

Gagan Bhatia, El Moatez Billah Nagoudi, Hasan Cavusoglu, Muhammad Abdul-Mageed

Links

Abstract / PDF

Why It Matters For Business

A 7B finance-specialized LLM, paired with retrieval and tool pipes, can deliver near-GPT-4 accuracy on many finance tasks at lower model-cost and with controllable hallucination rates, enabling in-house deployment where data control and latency matter.

Summary TLDR

FinTral is a family of finance-focused multimodal LLMs built on Mistral-7B and trained on FinSet (20B deduplicated financial tokens). The authors instruction-tune and align models with AI feedback (dDPO/DPO), add vision via CLIP, and enable tools + retrieval. The best variant (FinTral-DPO-T&R) achieves strong zero-shot results across nine financial tasks (avg ~0.70 on evaluated tasks), sharply reduces hallucinations (HI 0.97), and competes with GPT-4 on several benchmarks while remaining an open-source-sized 7B model.

Problem Statement

Existing LLMs struggle with finance-specific language, numbers, charts, and hallucinations. The paper targets a small (7B) multimodal model that better handles text, tables, numbers, and images for finance while reducing hallucinations and keeping compute reasonable.

Main Contribution

FinTral: a family of multimodal finance LLMs built from Mistral-7B with domain pretraining, instruction tuning, AI-feedback alignment, vision, tools, and retrieval.

FinSet: a cleaned, deduplicated finance pretraining corpus (20B tokens) plus large instruction and visual instruction sets for financial tasks.

Empirical benchmark across 9 financial task groups (23+ datasets) showing FinTral variants outperform many open models and approach or match GPT-4 on several tasks.

Practical recipe: LoRA/QLoRA pretraining/finetuning, dDPO/DPO for alignment, CLIP vision encoder, BGE retrieval, and simple tools for math.

Key Findings

FinTral-DPO-T&R reaches an average score of 0.70 on evaluated text tasks with tools and retrieval.

NumbersAvg 0.70 (Table 6)

Alignment with AI feedback (DPO/dDPO) raised average scores from 0.49 (FinTral-INST) to 0.59 (FinTral-DPO).

NumbersFinTral-INST 0.49 -> FinTral-DPO 0.59 (Table 5)

FinTral-DPO-T&R achieves a Hallucination Index (HI) of 0.97, near GPT-4-Turbo's 0.98 and above ChatGPT's 0.95 on defined-term multiple-choice.

NumbersHI 0.97 vs GPT-4-Turbo 0.98, ChatGPT 0.95 (Table 8)

FinTral-VL (vision-enabled) scores 0.63 on ChartQA and 0.75 on FinVQA; GPT-4V scores are higher (0.79, 0.89).

NumbersFinTral-VL ChartQA 0.63, FinVQA 0.75; GPT-4V 0.79/0.89 (Table 7)

Results

Average score (text tasks)

ValueFinTral-INST 0.49

Average score (text tasks)

ValueFinTral-DPO 0.59

BaselineChatGPT 0.53

Average score with tools+retrieval

ValueFinTral-DPO-T&R 0.70

BaselineGPT-4-Turbo 0.72

Hallucination Index (HI)

ValueFinTral-DPO-T&R 0.97

BaselineGPT-4-Turbo 0.98

Multimodal chart QA

ValueFinTral-VL CU avg 0.69 (ChartQA 0.63 / FinVQA 0.75)

BaselineGPT-4V CU avg 0.84 (ChartQA 0.79 / FinVQA 0.89)

Pretraining data size

Value20.0B deduplicated tokens

Instruction tuning size

Value226.3k instructions (post-dedup)

Visual instruction size

Value~1.1M multimodal alignment/instruction examples

Pretraining compute

Value80 hours on four 40GB A100 GPUs

Who Should Care

What To Try In 7 Days

Run a small pilot: fine-tune a 7B model on your domain subset and evaluate FinTral-style HI on key terms.

Add a retrieval index (BGE or similar) over recent company filings and test question-answering with retrieval.

Instrument a simple tools layer (calculator functions) for numeric tasks and compare error rates vs plain LLM outputs.

Agent Features

Tool Use

  • Offloads math to tools (Add/Subtract/Multiply style)
  • Uses retrieval to fetch supporting documents

Optimization Features

Token Efficiency

  • BPE tokenizer that segments numbers into single digits (helps numerical tasks)

Infra Optimization

  • Pretraining reported on four 40GB A100 GPUs

Model Optimization

  • LoRA

System Optimization

  • Sequence length up to 8k tokens for long documents

Training Optimization

  • dDPO / DPO alignment (no reward model)
  • FlashAttention-2 for faster attention

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Domain-specific: optimized for finance and may underperform outside finance.
  • Data cutoff Aug 1, 2023 limits up-to-date market facts.
  • Energy and compute: pretraining and tuning require substantial GPU time (80h on 4xA100 for pretraining step reported).
  • Partial openness: authors plan responsible releases, but dataset/model release is not fully open yet.

When Not To Use

  • For general-domain tasks outside finance where broad world knowledge is crucial.
  • When you need guaranteed live market data beyond the model's retrieval window without a live data feed.
  • If strict open-source licensing and immediate dataset release are required.

Failure Modes

  • Format-sensitive tasks (numerical tables, NER) can still suffer from instruction-following errors.
  • Vision module lags behind top closed-source multimodal models on complex charts.
  • Retrieval dependence: incorrect or stale index documents can lead to plausible but incorrect responses.

Core Entities

Models

  • FinTral-INST
  • FinTral-DPO
  • FinTral-DPO-T&R
  • FinTral-VL
  • Mistral-7B-v0.1
  • FinMA-7B
  • LLaMa-7Bchat
  • ChatGPT (gpt-3.5-turbo)
  • GPT-4 (gpt-4-0613)
  • GPT-4-Turbo (gpt-4-1106-preview)

Metrics

  • Accuracy
  • Average task score
  • Entity-F1
  • Exact Match
  • Rouge-score
  • Hallucination Index (HI)

Datasets

  • FinSet
  • FinVis-PT
  • FinVis-IT
  • FinVQA
  • FinVQAv2
  • ChartQA
  • FinTerms-MCQ
  • FinTerms-Gen
  • FinanceBench (sample)
  • FinQA
  • ConvFinQA
  • Finer-Ord / FiNER
  • FiQA-SA
  • FPB
  • ECTSUM
  • EDTSUM
  • BigData22
  • ACL18
  • CIKM18
  • GermanCredit
  • AustralianCredit

Benchmarks

  • FinSet benchmark (9 tasks, 23+ datasets)
  • FinVQA / ChartQA (chart understanding)
  • FinanceBench (open-book QA sample)