AlphaFin dataset + Stock-Chain: a RAG-enabled LLM system for stock prediction and financial Q&A

Overview

Decision SnapshotNeeds Validation

The system shows strong backtest gains on the presented test split and clear user-preference wins, but results come from a specific Chinese-focused dataset and synthetic data steps; expect more validation before live deployment.

Citations7

Evidence Strength0.60

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 40%

Authors

Xiang Li, Zhenyu Li, Chen Shi, Yong Xu, Qing Du, Mingkui Tan, Jun Huang, Wei Lin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Combining a domain-tuned LLM with retrieval of up-to-date reports and news can improve decision-support outputs and backtested portfolio returns compared to off-the-shelf models on this dataset.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

This paper releases AlphaFin, a multi-part financial dataset (reports, news, StockQA, research data) and presents Stock-Chain: a two-stage system that fine-tunes an LLM (StockGPT) with LoRA and augments it with a vector DB-based RAG pipeline for stock trend prediction and financial Q&A. On an out-of-sample AlphaFin test set, Stock-Chain reported higher annualized returns (30.8% ARR) and better human/GPT-4 preference scores than several baselines. The work focuses on Chinese financial sources, uses ChatGPT for data augmentation and summaries, and emphasizes reducing hallucinations via retrieval. Code and data are linked on the project GitHub.

Problem Statement

Current stock models either predict price movement from time-series data (ML/DL) without explanations or use LLMs that lack real-time facts and hallucinate. The field lacks high-quality financial training data and a practical pipeline that combines reasoning, real-time knowledge, and explainable predictions for investors.

Main Contribution

AlphaFin dataset suite combining research datasets, StockQA (prices + Q&A), financial news, financial reports, and 200 hand-written chain-of-thought (CoT) examples.

Stock-Chain system: two-stage pipeline (StockGPT fine-tuned on AlphaFin; RAG-powered vector DB retrieval for real-time knowledge) for stock trend prediction and conversational financial Q&A.

Key Findings

Stock-Chain achieved substantially higher backtested annualized return than baselines.

NumbersARR 30.8% for Stock-Chain vs 17.5% for FinGPT

Practical UseIntegrate retrieval and domain fine-tuning to materially improve medium-term backtested returns versus off-the-shelf FinLLMs on this test set.

Evidence RefTable 2

Fine-tuning with AlphaFin data raises LLM trading performance over vanilla models.

NumbersChatGLM2: ARR 8.1% → w/raw_data 15.8% → Stock-Chain 30.8%

Practical UseAdd domain-specific reports and simple Q&A examples to LLM fine-tuning before deploying financial prediction models.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Annualized Rate of Return (ARR)	30.8%	FinGPT 17.5%	+13.3 pp	AlphaFin-Test (financial report subset)	Table 2 shows ARR values for models	Table 2
Accuracy	55.7%	XGBoost 55.9%	-0.2 pp	AlphaFin-Test	Table 2 accuracy column	Table 2

What To Try In 7 Days

Build a small vector DB of company reports and news; add semantic embeddings (e.g., BGE) and cosine retrieval.

Fine-tune an existing instruction-tuned LLM with a handful of report-based Q&A pairs and a few CoT examples using LoRA.

Run a simple monthly backtest: pick stocks the model predicts 'up' and weight by market cap to compare ARR against an index.

Agent Features

Memory

retrieval memory (vector DB, continuously updated)

Tool Use

vector DB retrievalsentence embedding (BGE)

Frameworks

RAGLoRARefGPT

Architectures

two-stage (predict + conversational) pipelineRAG with vector DB plus LLM

Optimization Features

Infra Optimization

single A800 80GB reported for training

Model Optimization

LoRA

Training Optimization

staged fine-tuning (reports then CoT examples)bf16 training

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/AlphaFin-proj/AlphaFin

Data URLs

https://github.com/AlphaFin-proj/AlphaFin

Risks & Boundaries

Limitations

Data and evaluation focus on Chinese markets and Chinese text sources, limiting geographic generality.

Some training data (StockQA, summaries) were generated or augmented with ChatGPT, which can introduce bias or leakage.

When Not To Use

As a sole automated trading engine without rigorous live testing and risk controls.

For high-frequency or intraday trading, since the method is monthly and uses reports/news.

Failure Modes

Hallucinations when relevant documents are missing or retrieval fails.

Outdated knowledge if vector DB is not continuously updated.

Core Entities

Models

Stock-ChainStockGPTFinGPTFinMAChatGPTChatGLM2LSTMGRUXGBoostRandomforest

Metrics

ARRACCAERRANVOLSharpe RatioMaximum DrawdownCalmar RatioMDDROUGE-1ROUGE-2ROUGE-L

Datasets

AlphaFinAlphaFin-TestFPBFinQAConvFinQAHeadlineStockQAFinancial NewsFinancial ReportsDataYesTushareAKshare

Benchmarks

AlphaFin-Test

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Stock-Chain achieved substantially higher backtested annualized return than baselines.

Fine-tuning with AlphaFin data raises LLM trading performance over vanilla models.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding