Open-source multimodal financial LLMs trained on 52B tokens with instruction and chart/table tuning

Overview

Decision SnapshotNeeds Validation

The paper provides broad empirical evidence across many datasets and clear training recipes, but claims of beating closed-source models are limited to selected tasks and 8B models; practitioners should validate on their own data before production use.

Citations4

Evidence Strength0.85

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

License: MIT

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Jimin Huang, Mengxi Xiao, Dong Li, Zihao Jiang, Yuzhe Yang, Yifei Zhang, Lingfei Qian, Yan Wang, Xueqing Peng, Yang Ren, Ruoyu Xiang, Zhengyu Chen, Xiao Zhang, Yueru He, Weiguang Han, Shunian Chen, Lihang Shen, Daniel Kim, Yangyang Yu, Yupeng Cao, Zhiyang Deng, Haohang Li, Duanyu Feng, Yongfu Dai, VijayaSai Somasundaram, Peng Lu, Guojun Xiong, Zhiwei Liu, Zheheng Luo, Zhiyuan Yao, Ruey-Ling Weng, Meikang Qiu, Kaleb E Smith, Honghai Yu, Yanzhao Lai, Min Peng, Jian-Yun Nie, Jordan W. Suchow, Xiao-Yang Liu, Benyou Wang, Alejandro Lopez-Lira, Qianqian Xie, Sophia Ananiadou, Junichi Tsujii

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Models trained on large finance-specific corpora plus multimodal tuning make practical tasks—report parsing, numeric QA, and chart/table extraction—work better out of the box for analysts and automation.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

This paper introduces Open-FinLLMs: FinLLaMA (continual pre-trained on a 52B-token finance corpus), FinLLaMA-Instruct (finetuned on 573K financial instructions), and FinLLaVA (multimodal tuned with 1.43M image/table/chart pairs). The authors open-source code, data, and models and report broad gains on 14 financial task types (30 datasets) and 4 multimodal tasks. Key wins: stronger zero-/few-shot financial NLP, better numeric reasoning vs. general LLMs, and state-of-the-art open-source chart/table understanding (TableBench=72.4). Models are 8B-parameter LLaMA3 derivatives; limits include English-only evaluation and model size capped at 8B.

Problem Statement

General LLMs lack deep financial knowledge and weakly handle non-text financial data (tables, time series, charts). Prior financial models used small domain corpora or remained text-only, leaving zero/few-shot, multimodal reasoning, and decision-making underexplored. Open-FinLLMs aims to fill that gap by combining large continual pretraining, instruction tuning, and multimodal alignment for finance.

Main Contribution

FinLLaMA: continual pretraining of LLaMA3-8B on a 52 billion token finance-focused corpus (text, tables, time series).

FinLLaMA-Instruct: instruction finetuning with 573K curated financial instructions to boost domain task performance.

Key Findings

Large finance-focused continual pretraining improves zero/few-shot task performance.

NumbersFinLLaMA zero-shot TSA sentiment 81 vs LLaMA3-8B 75 (Table 5)

Practical UseIf you need better out-of-the-box finance NLP, start from a finance-continuously-pretrained backbone rather than a general LLM.

Evidence RefTable 5

Instruction tuning with a large math/finance instruction mix improves numeric understanding.

NumbersFinLLaMA-Instruct numeric understanding accuracy 0.69 vs GPT-4 0.63 (Table 7)

Practical UseTo improve numeric QA over financial documents, include large-scale math and finance instruction tuning like the 573K dataset used here.

Evidence RefTable 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Sentiment F1 (zero-shot)	81 (FinLLaMA on TSA)	75 (LLaMA3-8B)	+6	TSA	Table 5 reports FinLLaMA TSA=81, LLaMA3-8B=75	Table 5
Accuracy	0.69 (FinLLaMA-Instruct)	0.63 (GPT-4)	+0.06	Number understanding (ConvFinQA/FinQA aggregated)	Table 7 lists NU: FinLLaMA-Instruct=0.69, GPT-4=0.63	Table 7

What To Try In 7 Days

Download FinLLaMA-Instruct and test it on your financial QA and numeric-extraction prompts.

Feed a few example tables/charts to FinLLaVA to validate OCR + table extraction on your reports.

Run a quick few-shot comparison: your current model vs FinLLaMA on 5 representative tasks (sentiment, NER, numeric QA).

Agent Features

Memory

FinMem agent (memory module used in trading evaluation)

Tool Use

DeepSpeedLoRAAutoTrainData-Juicer

Frameworks

LLaVA-1.5 training framework

Architectures

LLaMA3-8B backboneCLIP vision encoder + two-layer MLP projector for multimodal alignment

Optimization Features

Token Efficiency

Max sequence length 8192 tokensPretraining chunk size 8192 tokens

Infra Optimization

DeepSpeed on 64 A100 80GB for CPTTraining on 8 A100 80GB (instruction), 8 HGX H20 (multimodal)

Model Optimization

INT4 quantization for instruction finetuningLoRA

System Optimization

bf16/tf32 precision for multimodal stagesMLP projector to map vision features into LLM embeddings

Training Optimization

Continual pretraining mixing ratio ~3:1 finance:generalSFTCosine LR schedule with warm-up

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseMIT

Code URLs

https://huggingface.co/collections/TheFinAI/open-finllms-66b671f2b4958a65e20decbe

Data URLs

https://huggingface.co/collections/TheFinAI/open-finllms-66b671f2b4958a65e20decbe

Risks & Boundaries

Limitations

Models are only 8B parameters; larger-scale behavior is untested.

Evaluations are English-only; multilingual performance unknown.

When Not To Use

For automated high-stakes financial advice without human oversight.

When multilingual or non-English coverage is required.

Failure Modes

Hallucinations in numeric or regulatory claims when source data absent.

OCR or table-parsing errors on low-resolution images.

Core Entities

Models

FinLLaMAFinLLaMA-InstructFinLLaVALLaMA3-8BLLaMA3.1-8BBloombergGPTGPT-4GPT-4oGemini-1.5-proLLaVA-1.5Palmyra-Fin-70B-32K

Metrics

F1AccuracyExact Match (EM)Sharpe RatioCumulative ReturnMax DrawdownAnnual Volatility

Datasets

52B-token continual pretraining corpus (papers, calls, reports, indicators, news, historical, SEC)FinLLaMA instruction set (573K)Multimodal instruction set (1.43M)ChartBenchTableBenchMMMUFinFactConvFinQAFiQA-SAFPBNERFinBEN

Benchmarks

ChartBenchTableBenchMMMUFinBENFinTral comparisons

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Large finance-focused continual pretraining improves zero/few-shot task performance.

Instruction tuning with a large math/finance instruction mix improves numeric understanding.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-