A domain-tuned LLaMA-65B (InvestLM) for finance that boosts financial NLP and matches many commercial LLMs in expert judgment.

Overview

Decision SnapshotNeeds Validation

The paper shows clear benchmark gains and human preference evidence, but expert judgments are limited (6 experts, 30 items) and some evaluation datasets are proprietary.

Citations24

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 7/7

Reproducibility

Status: Partial assets available

Open source: Partial

License: Adopts LLaMA license for released model parameters

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Yi Yang, Yixuan Tang, Kar Yan Tam

Links

Abstract / PDF / Code

Why It Matters For Business

A small, high-quality instruction set can turn an open foundation model into a capable finance assistant, offering a lower-cost, open alternative to closed commercial finance LLMs while enabling on-premise control and inspection.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder

Summary TLDR

InvestLM is a financial-domain LLM built by instruction-tuning LLaMA-65B on a small, manually curated set of 1,335 finance-focused instructions (sources: CFA, SEC filings, textbooks, StackExchange, journals, etc.). Using LoRA and context-extension to 8,192 tokens, the authors show InvestLM improves performance on 8/9 financial NLP tasks vs. the untuned LLaMA, yields large gains for 7B models (avg +138.4%) and moderate gains for 65B (avg +28.2%), and is judged by six finance experts as comparable to or better than GPT-3.5/GPT-4 while trailing Claude-2 in some comparisons. The model parameters are released under LLaMA terms.

Problem Statement

Closed commercial finance LLMs (e.g., BloombergGPT) block open research. Smaller public finance-tuned models generalize poorly. Can a small, high-quality instruction set turn a strong foundation model into a useful open financial assistant?

Main Contribution

Build InvestLM by instruction-tuning LLaMA-65B on a manually curated 1,335-example finance instruction set covering CFA, SEC filings, textbooks, journals, StackExchange, and crafted investment Q&A.

Use LoRA (rank=16) and Linear RoPE scaling to extend context to 8,192 tokens, enabling long-document finance tasks.

Key Findings

Instruction-tuning LLaMA-65B with ~1,300 curated finance instructions improves most finance tasks.

Numbers8 of 9 tasks: InvestLM > LLaMA-65B (Table 3); FinSent 0.71→0.79

Practical UseIf you have a strong base model, 1–2k high-quality domain instructions can meaningfully boost domain performance; try curated instruction tuning before collecting huge datasets.

Evidence RefTable 3

Smaller models benefit more from domain instruction tuning than larger models.

NumbersLLaMA-7B → InvestLM-7B avg improvement +138.4%; LLaMA-65B → InvestLM-65B +28.2% (Table 4).

Practical UseFor resource-constrained teams, invest effort in domain instruction tuning for smaller models (7B) to get bigger relative gains.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
FinSent Micro-F1	InvestLM 0.79	LLaMA-65B 0.71	+0.08	FinSent	Table 3: InvestLM vs LLaMA	Table 3
FPB Micro-F1	InvestLM 0.71	LLaMA-65B 0.38	+0.33	Financial PhraseBank (FPB)	Table 3: InvestLM vs LLaMA	Table 3

What To Try In 7 Days

Run InvestLM on a sample of your firm's SEC filings and compare summaries to analyst notes.

Fine-tune a 7B model with a few hundred curated domain instructions and measure micro-F1 on a key classification task.

Avoid mixing large generic instruction sets; test domain-only vs. mixed instruction tuning and compare results.

Agent Features

Memory

long-context (8,192 tokens) via RoPE scaling

Optimization Features

Model Optimization

LoRA

Training Optimization

15 epochs (65B), lr 3e-4, batch 16; 12 epochs (7B), lr 3e-3, batch 32

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseAdopts LLaMA license for released model parameters

Code URLs

https://github.com/AbaciNLP/InvestLM

Risks & Boundaries

Limitations

Expert evaluation is small-scale (six experts, 30 questions) and subjective.

Some evaluation datasets are proprietary, limiting external verification.

When Not To Use

Do not rely on InvestLM for automated trading decisions without human oversight.

Avoid using the model as a sole source of investment advice or legal compliance guidance.

Failure Modes

May still produce incorrect or risky investment suggestions; requires human validation.

Performance can degrade if generic instructions are mixed into a small domain-tuning set.

Core Entities

Models

InvestLM (InvestLM-65B)InvestLM-7BLLaMA-65BLLaMA-7BGPT-3.5GPT-4Claude-2BloombergGPTFinMA

Metrics

Micro-F1AccRouge-1Rouge-2Rouge-LCHRF++

Datasets

Instruction dataset (1,335 examples)FinSentFinancial PhraseBank (FPB)FOMCFiQAESGFLSQA (proprietary)FinQAECTSumAlpaca Instructions (52K)

Benchmarks

Micro-F1AccuracyROUGE-1ROUGE-2ROUGE-LCHRF++

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Instruction-tuning LLaMA-65B with ~1,300 curated finance instructions improves most finance tasks.

Smaller models benefit more from domain instruction tuning than larger models.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding