Build a modular Chinese financial LLM by instruction data and four task-specific LoRA experts

October 23, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

4

Authors

Wei Chen, Qiushi Wang, Zefei Long, Xianyin Zhang, Zhongtian Lu, Bingxuan Li, Siyuan Wang, Jiarong Xu, Xiang Bai, Xuanjing Huang, Zhongyu Wei

Links

Abstract / PDF

Why It Matters For Business

You can get domain gains cheaply by training small LoRA adapters and plugins instead of re-training big models; this yields better finance answers, more reliable calculations, and modular deployment.

Summary TLDR

The authors build DISC-FinLLM, a Chinese financial LLM that combines a 246k-example financial instruction dataset (DISC-FIN-SFT) with a Multiple Experts Fine-tuning Framework (MEFF). They train four task-specific LoRA adapters (consulting, NLP tasks, computing, retrieval) on Baichuan-13B and add simple tool plugins (calculator, equation solver, counter, probability table) and a retrieval plugin. LoRA experts give consistent gains across benchmarks (2–9 points avg.), improve calculation accuracy and retrieval-based answers, and let you swap capabilities without re-training the full model. Code is available; some training data and a proprietary knowledge base are not fully public.

Problem Statement

General LLMs lack specialized Chinese financial knowledge, robust numeric computation, multi-turn finance dialogs, and reliable retrieval in finance. Training huge closed models is costly; a compact, modular method is needed to adapt a base LLM to multiple financial tasks efficiently.

Main Contribution

DISC-FIN-SFT: a 246k-example Chinese financial instruction-tuning dataset covering consulting, NLP tasks, computing, and retrieval-enhanced instructions.

Multiple Experts Fine-tuning Framework (MEFF): train four separate LoRA adapters for distinct finance skills and load them modularly at runtime.

Calculation and retrieval plugins: four small computation tools and a retrieval pipeline integrated via instruction data and tool-call tokens.

Comprehensive evaluation: experiments on FinCUGE/FinEval-like benchmarks, manual calculation set, and a current-affairs retrieval test show measurable gains over base models.

Key Findings

Task-specific LoRA adapters raise average FinNLP performance by a few to several points versus the base model.

NumbersAverage improve +2 to +9 points on six FinCUGE tasks (Table 3)

Computation LoRA plus calculator plugin substantially improves formula creation and numeric answers.

NumbersFormula & result accuracy 0.35 vs Baichuan-13B-Chat 0.12 and GPT-3.5 0.26 (Table 5)

Retrieval-enhanced adapter improves human-judged utility and linguistic quality slightly over base chat model.

NumbersRetrieval metrics (accuracy/usefulness/linguistic/reflectiveness): 4.13/4.29/4.33/3.95 vs base 4.08/4.15/4.21/3.88 (GPT-

Results

FinCUGE (avg over 6 tasks)

ValueBaichuan-13B-Chat: 31 -> LoRA: 40 (example)

BaselineBaichuan-13B-Chat (untrained)

FinEval average

ValueDISC-FinLLM variants: ~50.6–51.6; GPT-4: 68.6; ChatGPT: 55.0

BaselineBaichuan-13B-Chat: 49.4

Accuracy

ValueDISC-FinLLM (Computing): 0.35

BaselineBaichuan-13B-Chat: 0.12; GPT-3.5: 0.26

Retrieval human-judged scores

ValueDISC-FinLLM (Retrieval): accuracy=4.13 usefulness=4.29 linguistic=4.33 reflectiveness=3.95

BaselineBaichuan-13B-Chat: 4.08 / 4.15 / 4.21 / 3.88

Who Should Care

What To Try In 7 Days

Fork the repo and run the Baichuan-13B base with a single LoRA adapter on your finance prompts.

Build a small calculator plugin and add tool-call tokens for arithmetic-heavy queries.

Create a 1–5k instruction seed (consulting + retrieval) from your internal docs and fine-tune a LoRA for retrieval.

Agent Features

Tool Use

  • Expression calculator
  • Equation solver
  • Counter
  • Probability table
  • Retrieval plugin

Frameworks

  • LoRA
  • Toolformer-style invocation
  • Chain-of-Thought
  • Chain-of-Retrieval

Architectures

  • LoRA
  • plugin-enabled model (tool calls)

Optimization Features

Infra Optimization

  • LoRA

Model Optimization

  • LoRA

Training Optimization

  • Task-specific adapter training to avoid full fine-tuning

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training data partly generated by ChatGPT, which can inject hallucinated or stylized answers.
  • Proprietary retrieval knowledge base is not fully public, limiting reproducibility.
  • Evaluations show modest gains; DISC-FinLLM still trails GPT-4 on benchmarks.

When Not To Use

  • High-stakes automated trading or compliance decisions requiring certified correctness.
  • Tasks needing live, real-time market feeds not covered by the static KB.
  • When you must match or exceed GPT-4-level performance.

Failure Modes

  • Hallucinated financial facts from ChatGPT-generated training content.
  • Wrong numeric answers if the model fails to call the compute plugin.
  • Relevant documents missed by retrieval and not surfaced in answers.

Core Entities

Models

  • Baichuan-13B
  • ChatGLM
  • ChatGLM2
  • GPT-3.5
  • GPT-4
  • FinGPT-v3
  • BloombergGPT
  • LLaMA-2-Chat-70B

Metrics

  • Accuracy
  • F1
  • ROUGE
  • usefulness
  • linguistic quality
  • reflectiveness

Datasets

  • SFT
  • FiQA
  • FPB
  • FNSC
  • Wealth-alpaca
  • SmoothNLP
  • FinCUGE
  • FinEval
  • FinFE
  • FinQA
  • FinCQA
  • FinNA
  • FinRE
  • FinESE

Benchmarks

  • FinCUGE (subset used)
  • FinEval
  • BBT-FIN (as shown in paper tables)