A domain-tuned LLaMA-65B (InvestLM) for finance that boosts financial NLP and matches many commercial LLMs in expert judgment.

September 15, 20237 min

Overview

Decision SnapshotNeeds Validation

The paper shows clear benchmark gains and human preference evidence, but expert judgments are limited (6 experts, 30 items) and some evaluation datasets are proprietary.

Citations24

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 7/7

Reproducibility

Status: Partial assets available

Open source: Partial

License: Adopts LLaMA license for released model parameters

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Yi Yang, Yixuan Tang, Kar Yan Tam

Links

Abstract / PDF / Code

Why It Matters For Business

A small, high-quality instruction set can turn an open foundation model into a capable finance assistant, offering a lower-cost, open alternative to closed commercial finance LLMs while enabling on-premise control and inspection.

Who Should Care

Summary TLDR

InvestLM is a financial-domain LLM built by instruction-tuning LLaMA-65B on a small, manually curated set of 1,335 finance-focused instructions (sources: CFA, SEC filings, textbooks, StackExchange, journals, etc.). Using LoRA and context-extension to 8,192 tokens, the authors show InvestLM improves performance on 8/9 financial NLP tasks vs. the untuned LLaMA, yields large gains for 7B models (avg +138.4%) and moderate gains for 65B (avg +28.2%), and is judged by six finance experts as comparable to or better than GPT-3.5/GPT-4 while trailing Claude-2 in some comparisons. The model parameters are released under LLaMA terms.

Problem Statement

Closed commercial finance LLMs (e.g., BloombergGPT) block open research. Smaller public finance-tuned models generalize poorly. Can a small, high-quality instruction set turn a strong foundation model into a useful open financial assistant?

Main Contribution

Build InvestLM by instruction-tuning LLaMA-65B on a manually curated 1,335-example finance instruction set covering CFA, SEC filings, textbooks, journals, StackExchange, and crafted investment Q&A.

Use LoRA (rank=16) and Linear RoPE scaling to extend context to 8,192 tokens, enabling long-document finance tasks.

Key Findings

Instruction-tuning LLaMA-65B with ~1,300 curated finance instructions improves most finance tasks.

Numbers8 of 9 tasks: InvestLM > LLaMA-65B (Table 3); FinSent 0.710.79

Practical UseIf you have a strong base model, 1–2k high-quality domain instructions can meaningfully boost domain performance; try curated instruction tuning before collecting huge datasets.

Evidence RefTable 3

Smaller models benefit more from domain instruction tuning than larger models.

NumbersLLaMA-7B → InvestLM-7B avg improvement +138.4%; LLaMA-65B → InvestLM-65B +28.2% (Table 4).

Practical UseFor resource-constrained teams, invest effort in domain instruction tuning for smaller models (7B) to get bigger relative gains.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
FinSent Micro-F1InvestLM 0.79LLaMA-65B 0.71+0.08FinSentTable 3: InvestLM vs LLaMATable 3
FPB Micro-F1InvestLM 0.71LLaMA-65B 0.38+0.33Financial PhraseBank (FPB)Table 3: InvestLM vs LLaMATable 3

What To Try In 7 Days

Run InvestLM on a sample of your firm's SEC filings and compare summaries to analyst notes.

Fine-tune a 7B model with a few hundred curated domain instructions and measure micro-F1 on a key classification task.

Avoid mixing large generic instruction sets; test domain-only vs. mixed instruction tuning and compare results.

Agent Features

Memory
long-context (8,192 tokens) via RoPE scaling

Optimization Features

Model Optimization
LoRA
Training Optimization
15 epochs (65B), lr 3e-4, batch 16; 12 epochs (7B), lr 3e-3, batch 32

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseAdopts LLaMA license for released model parameters

Risks & Boundaries

Limitations

Expert evaluation is small-scale (six experts, 30 questions) and subjective.

Some evaluation datasets are proprietary, limiting external verification.

When Not To Use

Do not rely on InvestLM for automated trading decisions without human oversight.

Avoid using the model as a sole source of investment advice or legal compliance guidance.

Failure Modes

May still produce incorrect or risky investment suggestions; requires human validation.

Performance can degrade if generic instructions are mixed into a small domain-tuning set.

Core Entities

Models

InvestLM (InvestLM-65B)InvestLM-7BLLaMA-65BLLaMA-7BGPT-3.5GPT-4Claude-2BloombergGPTFinMA

Metrics

Micro-F1AccRouge-1Rouge-2Rouge-LCHRF++

Datasets

Instruction dataset (1,335 examples)FinSentFinancial PhraseBank (FPB)FOMCFiQAESGFLSQA (proprietary)FinQAECTSumAlpaca Instructions (52K)

Benchmarks

Micro-F1AccuracyROUGE-1ROUGE-2ROUGE-LCHRF++