Open-source multimodal financial LLMs trained on 52B tokens with instruction and chart/table tuning

August 20, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper provides broad empirical evidence across many datasets and clear training recipes, but claims of beating closed-source models are limited to selected tasks and 8B models; practitioners should validate on their own data before production use.

Citations4

Evidence Strength0.85

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

License: MIT

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Jimin Huang, Mengxi Xiao, Dong Li, Zihao Jiang, Yuzhe Yang, Yifei Zhang, Lingfei Qian, Yan Wang, Xueqing Peng, Yang Ren, Ruoyu Xiang, Zhengyu Chen, Xiao Zhang, Yueru He, Weiguang Han, Shunian Chen, Lihang Shen, Daniel Kim, Yangyang Yu, Yupeng Cao, Zhiyang Deng, Haohang Li, Duanyu Feng, Yongfu Dai, VijayaSai Somasundaram, Peng Lu, Guojun Xiong, Zhiwei Liu, Zheheng Luo, Zhiyuan Yao, Ruey-Ling Weng, Meikang Qiu, Kaleb E Smith, Honghai Yu, Yanzhao Lai, Min Peng, Jian-Yun Nie, Jordan W. Suchow, Xiao-Yang Liu, Benyou Wang, Alejandro Lopez-Lira, Qianqian Xie, Sophia Ananiadou, Junichi Tsujii

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Models trained on large finance-specific corpora plus multimodal tuning make practical tasks—report parsing, numeric QA, and chart/table extraction—work better out of the box for analysts and automation.

Who Should Care

Summary TLDR

This paper introduces Open-FinLLMs: FinLLaMA (continual pre-trained on a 52B-token finance corpus), FinLLaMA-Instruct (finetuned on 573K financial instructions), and FinLLaVA (multimodal tuned with 1.43M image/table/chart pairs). The authors open-source code, data, and models and report broad gains on 14 financial task types (30 datasets) and 4 multimodal tasks. Key wins: stronger zero-/few-shot financial NLP, better numeric reasoning vs. general LLMs, and state-of-the-art open-source chart/table understanding (TableBench=72.4). Models are 8B-parameter LLaMA3 derivatives; limits include English-only evaluation and model size capped at 8B.

Problem Statement

General LLMs lack deep financial knowledge and weakly handle non-text financial data (tables, time series, charts). Prior financial models used small domain corpora or remained text-only, leaving zero/few-shot, multimodal reasoning, and decision-making underexplored. Open-FinLLMs aims to fill that gap by combining large continual pretraining, instruction tuning, and multimodal alignment for finance.

Main Contribution

FinLLaMA: continual pretraining of LLaMA3-8B on a 52 billion token finance-focused corpus (text, tables, time series).

FinLLaMA-Instruct: instruction finetuning with 573K curated financial instructions to boost domain task performance.

Key Findings

Large finance-focused continual pretraining improves zero/few-shot task performance.

NumbersFinLLaMA zero-shot TSA sentiment 81 vs LLaMA3-8B 75 (Table 5)

Practical UseIf you need better out-of-the-box finance NLP, start from a finance-continuously-pretrained backbone rather than a general LLM.

Evidence RefTable 5

Instruction tuning with a large math/finance instruction mix improves numeric understanding.

NumbersFinLLaMA-Instruct numeric understanding accuracy 0.69 vs GPT-4 0.63 (Table 7)

Practical UseTo improve numeric QA over financial documents, include large-scale math and finance instruction tuning like the 573K dataset used here.

Evidence RefTable 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Sentiment F1 (zero-shot)81 (FinLLaMA on TSA)75 (LLaMA3-8B)+6TSATable 5 reports FinLLaMA TSA=81, LLaMA3-8B=75Table 5
Accuracy0.69 (FinLLaMA-Instruct)0.63 (GPT-4)+0.06Number understanding (ConvFinQA/FinQA aggregated)Table 7 lists NU: FinLLaMA-Instruct=0.69, GPT-4=0.63Table 7

What To Try In 7 Days

Download FinLLaMA-Instruct and test it on your financial QA and numeric-extraction prompts.

Feed a few example tables/charts to FinLLaVA to validate OCR + table extraction on your reports.

Run a quick few-shot comparison: your current model vs FinLLaMA on 5 representative tasks (sentiment, NER, numeric QA).

Agent Features

Memory
FinMem agent (memory module used in trading evaluation)
Tool Use
DeepSpeedLoRAAutoTrainData-Juicer
Frameworks
LLaVA-1.5 training framework
Architectures
LLaMA3-8B backboneCLIP vision encoder + two-layer MLP projector for multimodal alignment

Optimization Features

Token Efficiency
Max sequence length 8192 tokensPretraining chunk size 8192 tokens
Infra Optimization
DeepSpeed on 64 A100 80GB for CPTTraining on 8 A100 80GB (instruction), 8 HGX H20 (multimodal)
Model Optimization
INT4 quantization for instruction finetuningLoRA
System Optimization
bf16/tf32 precision for multimodal stagesMLP projector to map vision features into LLM embeddings
Training Optimization
Continual pretraining mixing ratio ~3:1 finance:generalSFTCosine LR schedule with warm-up

Reproducibility

Risks & Boundaries

Limitations

Models are only 8B parameters; larger-scale behavior is untested.

Evaluations are English-only; multilingual performance unknown.

When Not To Use

For automated high-stakes financial advice without human oversight.

When multilingual or non-English coverage is required.

Failure Modes

Hallucinations in numeric or regulatory claims when source data absent.

OCR or table-parsing errors on low-resolution images.

Core Entities

Models

FinLLaMAFinLLaMA-InstructFinLLaVALLaMA3-8BLLaMA3.1-8BBloombergGPTGPT-4GPT-4oGemini-1.5-proLLaVA-1.5Palmyra-Fin-70B-32K

Metrics

F1AccuracyExact Match (EM)Sharpe RatioCumulative ReturnMax DrawdownAnnual Volatility

Datasets

52B-token continual pretraining corpus (papers, calls, reports, indicators, news, historical, SEC)FinLLaMA instruction set (573K)Multimodal instruction set (1.43M)ChartBenchTableBenchMMMUFinFactConvFinQAFiQA-SAFPBNERFinBEN

Benchmarks

ChartBenchTableBenchMMMUFinBENFinTral comparisons