Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
24
Why It Matters For Business
A small, high-quality instruction set can turn an open foundation model into a capable finance assistant, offering a lower-cost, open alternative to closed commercial finance LLMs while enabling on-premise control and inspection.
Summary TLDR
InvestLM is a financial-domain LLM built by instruction-tuning LLaMA-65B on a small, manually curated set of 1,335 finance-focused instructions (sources: CFA, SEC filings, textbooks, StackExchange, journals, etc.). Using LoRA and context-extension to 8,192 tokens, the authors show InvestLM improves performance on 8/9 financial NLP tasks vs. the untuned LLaMA, yields large gains for 7B models (avg +138.4%) and moderate gains for 65B (avg +28.2%), and is judged by six finance experts as comparable to or better than GPT-3.5/GPT-4 while trailing Claude-2 in some comparisons. The model parameters are released under LLaMA terms.
Problem Statement
Closed commercial finance LLMs (e.g., BloombergGPT) block open research. Smaller public finance-tuned models generalize poorly. Can a small, high-quality instruction set turn a strong foundation model into a useful open financial assistant?
Main Contribution
Build InvestLM by instruction-tuning LLaMA-65B on a manually curated 1,335-example finance instruction set covering CFA, SEC filings, textbooks, journals, StackExchange, and crafted investment Q&A.
Use LoRA (rank=16) and Linear RoPE scaling to extend context to 8,192 tokens, enabling long-document finance tasks.
Show InvestLM improves financial NLP benchmarks vs. LLaMA and achieves expert-rated responses comparable to GPT-3.5/GPT-4; release model parameters under LLaMA license.
Key Findings
Instruction-tuning LLaMA-65B with ~1,300 curated finance instructions improves most finance tasks.
Smaller models benefit more from domain instruction tuning than larger models.
Expert judges rate InvestLM comparable to or better than GPT-3.5 and GPT-4 on investment Q&A.
Adding a large general-purpose instruction dataset (Alpaca 52K) hurts domain performance.
Context-extension helps handle long financial texts.
Results
FinSent Micro-F1
FPB Micro-F1
FiQA Micro-F1
Accuracy
ECTSum Rouge-1
Avg improvement (7B)
Avg improvement (65B)
Who Should Care
What To Try In 7 Days
Run InvestLM on a sample of your firm's SEC filings and compare summaries to analyst notes.
Fine-tune a 7B model with a few hundred curated domain instructions and measure micro-F1 on a key classification task.
Avoid mixing large generic instruction sets; test domain-only vs. mixed instruction tuning and compare results.
Agent Features
Memory
- long-context (8,192 tokens) via RoPE scaling
Optimization Features
Model Optimization
- LoRA
Training Optimization
- 15 epochs (65B), lr 3e-4, batch 16; 12 epochs (7B), lr 3e-3, batch 32
Reproducibility
License
- Adopts LLaMA license for released model parameters
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Expert evaluation is small-scale (six experts, 30 questions) and subjective.
- Some evaluation datasets are proprietary, limiting external verification.
- InvestLM is not a financial advisor; authors explicitly warn against using outputs as definitive investment advice.
- Performance is mixed: some tasks (e.g., FLS) show no gain or degradation.
When Not To Use
- Do not rely on InvestLM for automated trading decisions without human oversight.
- Avoid using the model as a sole source of investment advice or legal compliance guidance.
- Avoid mixing large general-purpose instruction corpora when your goal is pure domain performance.
Failure Modes
- May still produce incorrect or risky investment suggestions; requires human validation.
- Performance can degrade if generic instructions are mixed into a small domain-tuning set.
- Evaluation bias: experts and GPT-4 preferences may not generalize across user populations or markets.
Core Entities
Models
- InvestLM (InvestLM-65B)
- InvestLM-7B
- LLaMA-65B
- LLaMA-7B
- GPT-3.5
- GPT-4
- Claude-2
- BloombergGPT
- FinMA
Metrics
- Micro-F1
- Acc
- Rouge-1
- Rouge-2
- Rouge-L
- CHRF++
Datasets
- Instruction dataset (1,335 examples)
- FinSent
- Financial PhraseBank (FPB)
- FOMC
- FiQA
- ESG
- FLS
- QA (proprietary)
- FinQA
- ECTSum
- Alpaca Instructions (52K)
Benchmarks
- Micro-F1
- Accuracy
- ROUGE-1
- ROUGE-2
- ROUGE-L
- CHRF++

