Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
4
Why It Matters For Business
Models trained on large finance-specific corpora plus multimodal tuning make practical tasks—report parsing, numeric QA, and chart/table extraction—work better out of the box for analysts and automation.
Summary TLDR
This paper introduces Open-FinLLMs: FinLLaMA (continual pre-trained on a 52B-token finance corpus), FinLLaMA-Instruct (finetuned on 573K financial instructions), and FinLLaVA (multimodal tuned with 1.43M image/table/chart pairs). The authors open-source code, data, and models and report broad gains on 14 financial task types (30 datasets) and 4 multimodal tasks. Key wins: stronger zero-/few-shot financial NLP, better numeric reasoning vs. general LLMs, and state-of-the-art open-source chart/table understanding (TableBench=72.4). Models are 8B-parameter LLaMA3 derivatives; limits include English-only evaluation and model size capped at 8B.
Problem Statement
General LLMs lack deep financial knowledge and weakly handle non-text financial data (tables, time series, charts). Prior financial models used small domain corpora or remained text-only, leaving zero/few-shot, multimodal reasoning, and decision-making underexplored. Open-FinLLMs aims to fill that gap by combining large continual pretraining, instruction tuning, and multimodal alignment for finance.
Main Contribution
FinLLaMA: continual pretraining of LLaMA3-8B on a 52 billion token finance-focused corpus (text, tables, time series).
FinLLaMA-Instruct: instruction finetuning with 573K curated financial instructions to boost domain task performance.
FinLLaVA: multimodal finetuning using 1.43M image/table/chart pairs and an image-to-LLM projector for chart and table reasoning.
Extensive evaluation across 14 task categories, 30 datasets, and new multimodal test sets (ChartBench, TableBench); release of code, data, and models under open licenses.
Key Findings
Large finance-focused continual pretraining improves zero/few-shot task performance.
Instruction tuning with a large math/finance instruction mix improves numeric understanding.
Multimodal tuning yields state-of-the-art open-source table understanding and strong chart performance.
Results
Sentiment F1 (zero-shot)
Accuracy
Accuracy
Overall trading cumulative return
Overall Sharpe Ratio (trading)
Who Should Care
What To Try In 7 Days
Download FinLLaMA-Instruct and test it on your financial QA and numeric-extraction prompts.
Feed a few example tables/charts to FinLLaVA to validate OCR + table extraction on your reports.
Run a quick few-shot comparison: your current model vs FinLLaMA on 5 representative tasks (sentiment, NER, numeric QA).
Agent Features
Memory
- FinMem agent (memory module used in trading evaluation)
Tool Use
- DeepSpeed
- LoRA
- AutoTrain
- Data-Juicer
Frameworks
- LLaVA-1.5 training framework
Architectures
- LLaMA3-8B backbone
- CLIP vision encoder + two-layer MLP projector for multimodal alignment
Optimization Features
Token Efficiency
- Max sequence length 8192 tokens
- Pretraining chunk size 8192 tokens
Infra Optimization
- DeepSpeed on 64 A100 80GB for CPT
- Training on 8 A100 80GB (instruction), 8 HGX H20 (multimodal)
Model Optimization
- INT4 quantization for instruction finetuning
- LoRA
System Optimization
- bf16/tf32 precision for multimodal stages
- MLP projector to map vision features into LLM embeddings
Training Optimization
- Continual pretraining mixing ratio ~3:1 finance:general
- SFT
- Cosine LR schedule with warm-up
Reproducibility
License
- MIT
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Models are only 8B parameters; larger-scale behavior is untested.
- Evaluations are English-only; multilingual performance unknown.
- Multimodal scope is charts and tables only; other modalities not covered.
- Some datasets and filtering rely on GPT-4/GPT-4o which can introduce selection bias.
When Not To Use
- For automated high-stakes financial advice without human oversight.
- When multilingual or non-English coverage is required.
- If you need modalities beyond text, tables, charts (e.g., audio or raw market feeds).
Failure Modes
- Hallucinations in numeric or regulatory claims when source data absent.
- OCR or table-parsing errors on low-resolution images.
- Overfitting to the curated finance corpus leading to blind spots in niche subdomains.
- Agent brittleness in trading when market regimes shift outside training period.
Core Entities
Models
- FinLLaMA
- FinLLaMA-Instruct
- FinLLaVA
- LLaMA3-8B
- LLaMA3.1-8B
- BloombergGPT
- GPT-4
- GPT-4o
- Gemini-1.5-pro
- LLaVA-1.5
- Palmyra-Fin-70B-32K
Metrics
- F1
- Accuracy
- Exact Match (EM)
- Sharpe Ratio
- Cumulative Return
- Max Drawdown
- Annual Volatility
Datasets
- 52B-token continual pretraining corpus (papers, calls, reports, indicators, news, historical, SEC)
- FinLLaMA instruction set (573K)
- Multimodal instruction set (1.43M)
- ChartBench
- TableBench
- MMMU
- FinFact
- ConvFinQA
- FiQA-SA
- FPB
- NER
- FinBEN
Benchmarks
- ChartBench
- TableBench
- MMMU
- FinBEN
- FinTral comparisons

