Open-source multimodal financial LLMs trained on 52B tokens with instruction and chart/table tuning

August 20, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

4

Authors

Jimin Huang, Mengxi Xiao, Dong Li, Zihao Jiang, Yuzhe Yang, Yifei Zhang, Lingfei Qian, Yan Wang, Xueqing Peng, Yang Ren, Ruoyu Xiang, Zhengyu Chen, Xiao Zhang, Yueru He, Weiguang Han, Shunian Chen, Lihang Shen, Daniel Kim, Yangyang Yu, Yupeng Cao, Zhiyang Deng, Haohang Li, Duanyu Feng, Yongfu Dai, VijayaSai Somasundaram, Peng Lu, Guojun Xiong, Zhiwei Liu, Zheheng Luo, Zhiyuan Yao, Ruey-Ling Weng, Meikang Qiu, Kaleb E Smith, Honghai Yu, Yanzhao Lai, Min Peng, Jian-Yun Nie, Jordan W. Suchow, Xiao-Yang Liu, Benyou Wang, Alejandro Lopez-Lira, Qianqian Xie, Sophia Ananiadou, Junichi Tsujii

Links

Abstract / PDF

Why It Matters For Business

Models trained on large finance-specific corpora plus multimodal tuning make practical tasks—report parsing, numeric QA, and chart/table extraction—work better out of the box for analysts and automation.

Summary TLDR

This paper introduces Open-FinLLMs: FinLLaMA (continual pre-trained on a 52B-token finance corpus), FinLLaMA-Instruct (finetuned on 573K financial instructions), and FinLLaVA (multimodal tuned with 1.43M image/table/chart pairs). The authors open-source code, data, and models and report broad gains on 14 financial task types (30 datasets) and 4 multimodal tasks. Key wins: stronger zero-/few-shot financial NLP, better numeric reasoning vs. general LLMs, and state-of-the-art open-source chart/table understanding (TableBench=72.4). Models are 8B-parameter LLaMA3 derivatives; limits include English-only evaluation and model size capped at 8B.

Problem Statement

General LLMs lack deep financial knowledge and weakly handle non-text financial data (tables, time series, charts). Prior financial models used small domain corpora or remained text-only, leaving zero/few-shot, multimodal reasoning, and decision-making underexplored. Open-FinLLMs aims to fill that gap by combining large continual pretraining, instruction tuning, and multimodal alignment for finance.

Main Contribution

FinLLaMA: continual pretraining of LLaMA3-8B on a 52 billion token finance-focused corpus (text, tables, time series).

FinLLaMA-Instruct: instruction finetuning with 573K curated financial instructions to boost domain task performance.

FinLLaVA: multimodal finetuning using 1.43M image/table/chart pairs and an image-to-LLM projector for chart and table reasoning.

Extensive evaluation across 14 task categories, 30 datasets, and new multimodal test sets (ChartBench, TableBench); release of code, data, and models under open licenses.

Key Findings

Large finance-focused continual pretraining improves zero/few-shot task performance.

NumbersFinLLaMA zero-shot TSA sentiment 81 vs LLaMA3-8B 75 (Table 5)

Instruction tuning with a large math/finance instruction mix improves numeric understanding.

NumbersFinLLaMA-Instruct numeric understanding accuracy 0.69 vs GPT-4 0.63 (Table 7)

Multimodal tuning yields state-of-the-art open-source table understanding and strong chart performance.

NumbersFinLLaVA TableBench 72.4 vs GPT-4o 66.7 and Gemini-1.5-pro 58.2 (Table 6)

Results

Sentiment F1 (zero-shot)

Value81 (FinLLaMA on TSA)

Baseline75 (LLaMA3-8B)

Accuracy

Value0.69 (FinLLaMA-Instruct)

Baseline0.63 (GPT-4)

Accuracy

Value72.4 (FinLLaVA)

Baseline66.7 (GPT-4o); 58.2 (Gemini-1.5-pro)

Overall trading cumulative return

Value0.3265 (FinLLaMA overall)

Baseline-0.0942 (LLaMA3-8B overall)

Overall Sharpe Ratio (trading)

Value1.4088 (FinLLaMA overall)

Baseline-0.2334 (LLaMA3-8B overall)

Who Should Care

What To Try In 7 Days

Download FinLLaMA-Instruct and test it on your financial QA and numeric-extraction prompts.

Feed a few example tables/charts to FinLLaVA to validate OCR + table extraction on your reports.

Run a quick few-shot comparison: your current model vs FinLLaMA on 5 representative tasks (sentiment, NER, numeric QA).

Agent Features

Memory

  • FinMem agent (memory module used in trading evaluation)

Tool Use

  • DeepSpeed
  • LoRA
  • AutoTrain
  • Data-Juicer

Frameworks

  • LLaVA-1.5 training framework

Architectures

  • LLaMA3-8B backbone
  • CLIP vision encoder + two-layer MLP projector for multimodal alignment

Optimization Features

Token Efficiency

  • Max sequence length 8192 tokens
  • Pretraining chunk size 8192 tokens

Infra Optimization

  • DeepSpeed on 64 A100 80GB for CPT
  • Training on 8 A100 80GB (instruction), 8 HGX H20 (multimodal)

Model Optimization

  • INT4 quantization for instruction finetuning
  • LoRA

System Optimization

  • bf16/tf32 precision for multimodal stages
  • MLP projector to map vision features into LLM embeddings

Training Optimization

  • Continual pretraining mixing ratio ~3:1 finance:general
  • SFT
  • Cosine LR schedule with warm-up

Reproducibility

License

  • MIT

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Models are only 8B parameters; larger-scale behavior is untested.
  • Evaluations are English-only; multilingual performance unknown.
  • Multimodal scope is charts and tables only; other modalities not covered.
  • Some datasets and filtering rely on GPT-4/GPT-4o which can introduce selection bias.

When Not To Use

  • For automated high-stakes financial advice without human oversight.
  • When multilingual or non-English coverage is required.
  • If you need modalities beyond text, tables, charts (e.g., audio or raw market feeds).

Failure Modes

  • Hallucinations in numeric or regulatory claims when source data absent.
  • OCR or table-parsing errors on low-resolution images.
  • Overfitting to the curated finance corpus leading to blind spots in niche subdomains.
  • Agent brittleness in trading when market regimes shift outside training period.

Core Entities

Models

  • FinLLaMA
  • FinLLaMA-Instruct
  • FinLLaVA
  • LLaMA3-8B
  • LLaMA3.1-8B
  • BloombergGPT
  • GPT-4
  • GPT-4o
  • Gemini-1.5-pro
  • LLaVA-1.5
  • Palmyra-Fin-70B-32K

Metrics

  • F1
  • Accuracy
  • Exact Match (EM)
  • Sharpe Ratio
  • Cumulative Return
  • Max Drawdown
  • Annual Volatility

Datasets

  • 52B-token continual pretraining corpus (papers, calls, reports, indicators, news, historical, SEC)
  • FinLLaMA instruction set (573K)
  • Multimodal instruction set (1.43M)
  • ChartBench
  • TableBench
  • MMMU
  • FinFact
  • ConvFinQA
  • FiQA-SA
  • FPB
  • NER
  • FinBEN

Benchmarks

  • ChartBench
  • TableBench
  • MMMU
  • FinBEN
  • FinTral comparisons