AlphaFin dataset + Stock-Chain: a RAG-enabled LLM system for stock prediction and financial Q&A

March 19, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

7

Authors

Xiang Li, Zhenyu Li, Chen Shi, Yong Xu, Qing Du, Mingkui Tan, Jun Huang, Wei Lin

Links

Abstract / PDF

Why It Matters For Business

Combining a domain-tuned LLM with retrieval of up-to-date reports and news can improve decision-support outputs and backtested portfolio returns compared to off-the-shelf models on this dataset.

Summary TLDR

This paper releases AlphaFin, a multi-part financial dataset (reports, news, StockQA, research data) and presents Stock-Chain: a two-stage system that fine-tunes an LLM (StockGPT) with LoRA and augments it with a vector DB-based RAG pipeline for stock trend prediction and financial Q&A. On an out-of-sample AlphaFin test set, Stock-Chain reported higher annualized returns (30.8% ARR) and better human/GPT-4 preference scores than several baselines. The work focuses on Chinese financial sources, uses ChatGPT for data augmentation and summaries, and emphasizes reducing hallucinations via retrieval. Code and data are linked on the project GitHub.

Problem Statement

Current stock models either predict price movement from time-series data (ML/DL) without explanations or use LLMs that lack real-time facts and hallucinate. The field lacks high-quality financial training data and a practical pipeline that combines reasoning, real-time knowledge, and explainable predictions for investors.

Main Contribution

AlphaFin dataset suite combining research datasets, StockQA (prices + Q&A), financial news, financial reports, and 200 hand-written chain-of-thought (CoT) examples.

Stock-Chain system: two-stage pipeline (StockGPT fine-tuned on AlphaFin; RAG-powered vector DB retrieval for real-time knowledge) for stock trend prediction and conversational financial Q&A.

Training recipe: staged fine-tuning with LoRA and CoT data to improve analysis and reduce invalid outputs.

Extensive evaluation: backtested trading ARR, classification accuracy, ablations, human and GPT-4 preference studies, and case studies.

Key Findings

Stock-Chain achieved substantially higher backtested annualized return than baselines.

NumbersARR 30.8% for Stock-Chain vs 17.5% for FinGPT

Fine-tuning with AlphaFin data raises LLM trading performance over vanilla models.

NumbersChatGLM2: ARR 8.1% → w/raw_data 15.8% → Stock-Chain 30.8%

Human and GPT-4 judges prefer Stock-Chain outputs over other LLMs.

NumbersHuman win rate >60% vs ChatGLM2; 62% vs FinGPT. GPT‑4: 58% vs ChatGPT, 73% vs ChatGLM2

Invalid or unusable answers fell but remain significant.

NumbersInvalid answer ratio reduced to 25.9% for Stock-Chain

Results

Annualized Rate of Return (ARR)

Value30.8%

BaselineFinGPT 17.5%

Accuracy

Value55.7%

BaselineXGBoost 55.9%

ARR after staged fine-tuning

ValueStock-Chain 30.8% (best)

BaselineChatGLM2 8.1%

ROUGE-1 (generation quality)

Value0.4352

BaselineChatGLM2 0.2794

Who Should Care

What To Try In 7 Days

Build a small vector DB of company reports and news; add semantic embeddings (e.g., BGE) and cosine retrieval.

Fine-tune an existing instruction-tuned LLM with a handful of report-based Q&A pairs and a few CoT examples using LoRA.

Run a simple monthly backtest: pick stocks the model predicts 'up' and weight by market cap to compare ARR against an index.

Agent Features

Memory

  • retrieval memory (vector DB, continuously updated)

Tool Use

  • vector DB retrieval
  • sentence embedding (BGE)

Frameworks

  • RAG
  • LoRA
  • RefGPT

Architectures

  • two-stage (predict + conversational) pipeline
  • RAG with vector DB plus LLM

Optimization Features

Infra Optimization

  • single A800 80GB reported for training

Model Optimization

  • LoRA

Training Optimization

  • staged fine-tuning (reports then CoT examples)
  • bf16 training

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Data and evaluation focus on Chinese markets and Chinese text sources, limiting geographic generality.
  • Some training data (StockQA, summaries) were generated or augmented with ChatGPT, which can introduce bias or leakage.
  • Backtest results can overstate real-world performance; ACC is modest (~56%) and invalid output rate is non-negligible.

When Not To Use

  • As a sole automated trading engine without rigorous live testing and risk controls.
  • For high-frequency or intraday trading, since the method is monthly and uses reports/news.
  • For markets or languages not covered by AlphaFin without re-collection and re-tuning.

Failure Modes

  • Hallucinations when relevant documents are missing or retrieval fails.
  • Outdated knowledge if vector DB is not continuously updated.
  • High proportion of invalid answers (≈25.9%) on some queries.

Core Entities

Models

  • Stock-Chain
  • StockGPT
  • FinGPT
  • FinMA
  • ChatGPT
  • ChatGLM2
  • LSTM
  • GRU
  • XGBoost
  • Randomforest

Metrics

  • ARR
  • ACC
  • AERR
  • ANVOL
  • Sharpe Ratio
  • Maximum Drawdown
  • Calmar Ratio
  • MDD
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L

Datasets

  • AlphaFin
  • AlphaFin-Test
  • FPB
  • FinQA
  • ConvFinQA
  • Headline
  • StockQA
  • Financial News
  • Financial Reports
  • DataYes
  • Tushare
  • AKshare

Benchmarks

  • AlphaFin-Test