A domain-tuned LLaMA-65B (InvestLM) for finance that boosts financial NLP and matches many commercial LLMs in expert judgment.

September 15, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

24

Authors

Yi Yang, Yixuan Tang, Kar Yan Tam

Links

Abstract / PDF

Why It Matters For Business

A small, high-quality instruction set can turn an open foundation model into a capable finance assistant, offering a lower-cost, open alternative to closed commercial finance LLMs while enabling on-premise control and inspection.

Summary TLDR

InvestLM is a financial-domain LLM built by instruction-tuning LLaMA-65B on a small, manually curated set of 1,335 finance-focused instructions (sources: CFA, SEC filings, textbooks, StackExchange, journals, etc.). Using LoRA and context-extension to 8,192 tokens, the authors show InvestLM improves performance on 8/9 financial NLP tasks vs. the untuned LLaMA, yields large gains for 7B models (avg +138.4%) and moderate gains for 65B (avg +28.2%), and is judged by six finance experts as comparable to or better than GPT-3.5/GPT-4 while trailing Claude-2 in some comparisons. The model parameters are released under LLaMA terms.

Problem Statement

Closed commercial finance LLMs (e.g., BloombergGPT) block open research. Smaller public finance-tuned models generalize poorly. Can a small, high-quality instruction set turn a strong foundation model into a useful open financial assistant?

Main Contribution

Build InvestLM by instruction-tuning LLaMA-65B on a manually curated 1,335-example finance instruction set covering CFA, SEC filings, textbooks, journals, StackExchange, and crafted investment Q&A.

Use LoRA (rank=16) and Linear RoPE scaling to extend context to 8,192 tokens, enabling long-document finance tasks.

Show InvestLM improves financial NLP benchmarks vs. LLaMA and achieves expert-rated responses comparable to GPT-3.5/GPT-4; release model parameters under LLaMA license.

Key Findings

Instruction-tuning LLaMA-65B with ~1,300 curated finance instructions improves most finance tasks.

Numbers8 of 9 tasks: InvestLM > LLaMA-65B (Table 3); FinSent 0.71→0.79

Smaller models benefit more from domain instruction tuning than larger models.

NumbersLLaMA-7B → InvestLM-7B avg improvement +138.4%; LLaMA-65B → InvestLM-65B +28.2% (Table 4).

Expert judges rate InvestLM comparable to or better than GPT-3.5 and GPT-4 on investment Q&A.

NumbersExpert assessment and GPT-4 evaluation preference reported; inter-annotator agreement 72.5%.

Adding a large general-purpose instruction dataset (Alpaca 52K) hurts domain performance.

NumbersFPB micro-F1 drops 0.74 → 0.42; multiple tasks degrade (Table 5).

Context-extension helps handle long financial texts.

NumbersAuthors apply Linear RoPE scaling ×4 to reach 8,192 token context (Section 3).

Results

FinSent Micro-F1

ValueInvestLM 0.79

BaselineLLaMA-65B 0.71

FPB Micro-F1

ValueInvestLM 0.71

BaselineLLaMA-65B 0.38

FiQA Micro-F1

ValueInvestLM 0.90

BaselineLLaMA-65B 0.75

Accuracy

ValueInvestLM 0.29

BaselineLLaMA-65B 0.23

ECTSum Rouge-1

ValueInvestLM 0.26

BaselineLLaMA-65B 0.14

Avg improvement (7B)

ValueInvestLM-7B avg +138.4%

BaselineLLaMA-7B

Avg improvement (65B)

ValueInvestLM-65B avg +28.2%

BaselineLLaMA-65B

Who Should Care

What To Try In 7 Days

Run InvestLM on a sample of your firm's SEC filings and compare summaries to analyst notes.

Fine-tune a 7B model with a few hundred curated domain instructions and measure micro-F1 on a key classification task.

Avoid mixing large generic instruction sets; test domain-only vs. mixed instruction tuning and compare results.

Agent Features

Memory

  • long-context (8,192 tokens) via RoPE scaling

Optimization Features

Model Optimization

  • LoRA

Training Optimization

  • 15 epochs (65B), lr 3e-4, batch 16; 12 epochs (7B), lr 3e-3, batch 32

Reproducibility

License

  • Adopts LLaMA license for released model parameters

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Expert evaluation is small-scale (six experts, 30 questions) and subjective.
  • Some evaluation datasets are proprietary, limiting external verification.
  • InvestLM is not a financial advisor; authors explicitly warn against using outputs as definitive investment advice.
  • Performance is mixed: some tasks (e.g., FLS) show no gain or degradation.

When Not To Use

  • Do not rely on InvestLM for automated trading decisions without human oversight.
  • Avoid using the model as a sole source of investment advice or legal compliance guidance.
  • Avoid mixing large general-purpose instruction corpora when your goal is pure domain performance.

Failure Modes

  • May still produce incorrect or risky investment suggestions; requires human validation.
  • Performance can degrade if generic instructions are mixed into a small domain-tuning set.
  • Evaluation bias: experts and GPT-4 preferences may not generalize across user populations or markets.

Core Entities

Models

  • InvestLM (InvestLM-65B)
  • InvestLM-7B
  • LLaMA-65B
  • LLaMA-7B
  • GPT-3.5
  • GPT-4
  • Claude-2
  • BloombergGPT
  • FinMA

Metrics

  • Micro-F1
  • Acc
  • Rouge-1
  • Rouge-2
  • Rouge-L
  • CHRF++

Datasets

  • Instruction dataset (1,335 examples)
  • FinSent
  • Financial PhraseBank (FPB)
  • FOMC
  • FiQA
  • ESG
  • FLS
  • QA (proprietary)
  • FinQA
  • ECTSum
  • Alpaca Instructions (52K)

Benchmarks

  • Micro-F1
  • Accuracy
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • CHRF++