Overview
Production Readiness
0.5
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
7
Why It Matters For Business
Combining a domain-tuned LLM with retrieval of up-to-date reports and news can improve decision-support outputs and backtested portfolio returns compared to off-the-shelf models on this dataset.
Summary TLDR
This paper releases AlphaFin, a multi-part financial dataset (reports, news, StockQA, research data) and presents Stock-Chain: a two-stage system that fine-tunes an LLM (StockGPT) with LoRA and augments it with a vector DB-based RAG pipeline for stock trend prediction and financial Q&A. On an out-of-sample AlphaFin test set, Stock-Chain reported higher annualized returns (30.8% ARR) and better human/GPT-4 preference scores than several baselines. The work focuses on Chinese financial sources, uses ChatGPT for data augmentation and summaries, and emphasizes reducing hallucinations via retrieval. Code and data are linked on the project GitHub.
Problem Statement
Current stock models either predict price movement from time-series data (ML/DL) without explanations or use LLMs that lack real-time facts and hallucinate. The field lacks high-quality financial training data and a practical pipeline that combines reasoning, real-time knowledge, and explainable predictions for investors.
Main Contribution
AlphaFin dataset suite combining research datasets, StockQA (prices + Q&A), financial news, financial reports, and 200 hand-written chain-of-thought (CoT) examples.
Stock-Chain system: two-stage pipeline (StockGPT fine-tuned on AlphaFin; RAG-powered vector DB retrieval for real-time knowledge) for stock trend prediction and conversational financial Q&A.
Training recipe: staged fine-tuning with LoRA and CoT data to improve analysis and reduce invalid outputs.
Extensive evaluation: backtested trading ARR, classification accuracy, ablations, human and GPT-4 preference studies, and case studies.
Key Findings
Stock-Chain achieved substantially higher backtested annualized return than baselines.
Fine-tuning with AlphaFin data raises LLM trading performance over vanilla models.
Human and GPT-4 judges prefer Stock-Chain outputs over other LLMs.
Invalid or unusable answers fell but remain significant.
Results
Annualized Rate of Return (ARR)
Accuracy
ARR after staged fine-tuning
ROUGE-1 (generation quality)
Who Should Care
What To Try In 7 Days
Build a small vector DB of company reports and news; add semantic embeddings (e.g., BGE) and cosine retrieval.
Fine-tune an existing instruction-tuned LLM with a handful of report-based Q&A pairs and a few CoT examples using LoRA.
Run a simple monthly backtest: pick stocks the model predicts 'up' and weight by market cap to compare ARR against an index.
Agent Features
Memory
- retrieval memory (vector DB, continuously updated)
Tool Use
- vector DB retrieval
- sentence embedding (BGE)
Frameworks
- RAG
- LoRA
- RefGPT
Architectures
- two-stage (predict + conversational) pipeline
- RAG with vector DB plus LLM
Optimization Features
Infra Optimization
- single A800 80GB reported for training
Model Optimization
- LoRA
Training Optimization
- staged fine-tuning (reports then CoT examples)
- bf16 training
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Data and evaluation focus on Chinese markets and Chinese text sources, limiting geographic generality.
- Some training data (StockQA, summaries) were generated or augmented with ChatGPT, which can introduce bias or leakage.
- Backtest results can overstate real-world performance; ACC is modest (~56%) and invalid output rate is non-negligible.
When Not To Use
- As a sole automated trading engine without rigorous live testing and risk controls.
- For high-frequency or intraday trading, since the method is monthly and uses reports/news.
- For markets or languages not covered by AlphaFin without re-collection and re-tuning.
Failure Modes
- Hallucinations when relevant documents are missing or retrieval fails.
- Outdated knowledge if vector DB is not continuously updated.
- High proportion of invalid answers (≈25.9%) on some queries.
Core Entities
Models
- Stock-Chain
- StockGPT
- FinGPT
- FinMA
- ChatGPT
- ChatGLM2
- LSTM
- GRU
- XGBoost
- Randomforest
Metrics
- ARR
- ACC
- AERR
- ANVOL
- Sharpe Ratio
- Maximum Drawdown
- Calmar Ratio
- MDD
- ROUGE-1
- ROUGE-2
- ROUGE-L
Datasets
- AlphaFin
- AlphaFin-Test
- FPB
- FinQA
- ConvFinQA
- Headline
- StockQA
- Financial News
- Financial Reports
- DataYes
- Tushare
- AKshare
Benchmarks
- AlphaFin-Test

