Build a modular Chinese financial LLM by instruction data and four task-specific LoRA experts

Overview

Decision SnapshotNeeds Validation

The approach is practical: modular LoRA adapters and small plugins let teams get domain improvements without heavy compute, but gains are benchmark-level and the system does not match closed high-end models.

Citations4

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Wei Chen, Qiushi Wang, Zefei Long, Xianyin Zhang, Zhongtian Lu, Bingxuan Li, Siyuan Wang, Jiarong Xu, Xiang Bai, Xuanjing Huang, Zhongyu Wei

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can get domain gains cheaply by training small LoRA adapters and plugins instead of re-training big models; this yields better finance answers, more reliable calculations, and modular deployment.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors build DISC-FinLLM, a Chinese financial LLM that combines a 246k-example financial instruction dataset (DISC-FIN-SFT) with a Multiple Experts Fine-tuning Framework (MEFF). They train four task-specific LoRA adapters (consulting, NLP tasks, computing, retrieval) on Baichuan-13B and add simple tool plugins (calculator, equation solver, counter, probability table) and a retrieval plugin. LoRA experts give consistent gains across benchmarks (2–9 points avg.), improve calculation accuracy and retrieval-based answers, and let you swap capabilities without re-training the full model. Code is available; some training data and a proprietary knowledge base are not fully public.

Problem Statement

General LLMs lack specialized Chinese financial knowledge, robust numeric computation, multi-turn finance dialogs, and reliable retrieval in finance. Training huge closed models is costly; a compact, modular method is needed to adapt a base LLM to multiple financial tasks efficiently.

Main Contribution

DISC-FIN-SFT: a 246k-example Chinese financial instruction-tuning dataset covering consulting, NLP tasks, computing, and retrieval-enhanced instructions.

Multiple Experts Fine-tuning Framework (MEFF): train four separate LoRA adapters for distinct finance skills and load them modularly at runtime.

Key Findings

Task-specific LoRA adapters raise average FinNLP performance by a few to several points versus the base model.

NumbersAverage improve +2 to +9 points on six FinCUGE tasks (Table 3)

Practical UseFine-tune small LoRA modules per task to get meaningful domain gains without full-model training; expect single-digit avg. score lifts on finance NLP benchmarks.

Evidence RefTable 3

Computation LoRA plus calculator plugin substantially improves formula creation and numeric answers.

NumbersFormula & result accuracy 0.35 vs Baichuan-13B-Chat 0.12 and GPT-3.5 0.26 (Table 5)

Practical UseAdd a tool-based compute adapter when needing reliable numeric outputs; this reduces wrong arithmetic compared to a vanilla chat model.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
FinCUGE (avg over 6 tasks)	Baichuan-13B-Chat: 31 -> LoRA: 40 (example)	Baichuan-13B-Chat (untrained)	+9	Table 3 (BBT-FIN/FinCUGE subset)	LoRA training improved average from 31 to 40 for Baichuan-13B-Chat	Table 3
FinEval average	DISC-FinLLM variants: ~50.6–51.6; GPT-4: 68.6; ChatGPT: 55.0	Baichuan-13B-Chat: 49.4	DISC variants +1.2 to +8.0 vs base	FIN-Eval (Table 4)	DISC-FinLLM consulting/task/retrieval/computing variants score ~50–51.6 vs base 49.4	Table 4

What To Try In 7 Days

Fork the repo and run the Baichuan-13B base with a single LoRA adapter on your finance prompts.

Build a small calculator plugin and add tool-call tokens for arithmetic-heavy queries.

Create a 1–5k instruction seed (consulting + retrieval) from your internal docs and fine-tune a LoRA for retrieval.

Agent Features

Tool Use

Expression calculatorEquation solverCounterProbability tableRetrieval plugin

Frameworks

LoRAToolformer-style invocationChain-of-ThoughtChain-of-Retrieval

Architectures

LoRAplugin-enabled model (tool calls)

Optimization Features

Infra Optimization

LoRA

Model Optimization

LoRA

Training Optimization

Task-specific adapter training to avoid full fine-tuning

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/FudanDISC/DISC-FinLLM

Data URLs

https://github.com/FudanDISC/DISC-FinLLM

Risks & Boundaries

Limitations

Training data partly generated by ChatGPT, which can inject hallucinated or stylized answers.

Proprietary retrieval knowledge base is not fully public, limiting reproducibility.

When Not To Use

High-stakes automated trading or compliance decisions requiring certified correctness.

Tasks needing live, real-time market feeds not covered by the static KB.

Failure Modes

Hallucinated financial facts from ChatGPT-generated training content.

Wrong numeric answers if the model fails to call the compute plugin.

Core Entities

Models

Baichuan-13BChatGLMChatGLM2GPT-3.5GPT-4FinGPT-v3BloombergGPTLLaMA-2-Chat-70B

Metrics

AccuracyF1ROUGEusefulnesslinguistic qualityreflectiveness

Datasets

SFTFiQAFPBFNSCWealth-alpacaSmoothNLPFinCUGEFinEvalFinFEFinQAFinCQAFinNAFinREFinESE

Benchmarks

FinCUGE (subset used)FinEvalBBT-FIN (as shown in paper tables)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Task-specific LoRA adapters raise average FinNLP performance by a few to several points versus the base model.

Computation LoRA plus calculator plugin substantially improves formula creation and numeric answers.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding