SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

August 5, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.45

Cost Impact Score

0.6

Citation Count

1

Authors

Shujuan Zhao, Lingfeng Qiao, Kangyang Luo, Qian-Wen Zhang, Junru Lu, Di Yin

Links

Abstract / PDF

Why It Matters For Business

Domain pre-training plus instruction tuning yields measurable accuracy gains on finance QA and exam tasks; adding a calculator reduces numeric errors—useful for advisory, research automation, and computation-heavy workflows.

Summary TLDR

SNFinLLM is a Chinese financial assistant built by (1) continuous domain pre-training on ~100B unsupervised tokens (25B finance), (2) 550k supervised instruction examples for full-parameter fine-tuning, and (3) a Direct Preference Optimization (DPO) step. The authors add a Python-executable calculator expression to handle numeric tasks. On standard benchmarks (FinEval, FinanceIQ) and internal finance test sets, SNFinLLM variants beat an open-source baseline and several Chinese financial LLMs on many tasks. Calculator integration raises finance-computation accuracy; DPO improves some exam-style metrics but can hurt computation and MRC. Code and data availability are not stated.

Problem Statement

Generic and existing finance LLMs suffer hallucinations, weak calculation accuracy, and limited instruction-following in Chinese finance tasks. The paper aims to build a Chinese financial LLM that (a) learns domain knowledge, (b) follows finance-style instructions, and (c) performs reliable numeric computations.

Main Contribution

A three-stage training pipeline: continuous domain pre-training, full-parameter supervised fine-tuning (550k instructions), then DPO alignment.

Curated financial corpora: 25B finance tokens plus general data to total ~100B unsupervised tokens and 7,689 new domain tokens.

Built computation instruction data and a Python-executable [Calculator(expression)->result] format to ensure correct numeric results.

Empirical evaluation on FinEval, FinanceIQ and five internal finance datasets showing consistent gains over an open-source baseline and other Chinese finance LLMs.

Ablation studies showing benefits of domain pre-training and calculator tool, and mixed effects from DPO alignment.

Key Findings

Domain continuous pre-training raises benchmark accuracy.

NumbersFEval +4.64 pp (59.30 → 63.94)

Instruction fine-tuning produces a stronger instruction-following assistant and improves exam-style tasks vs peers.

NumbersqEQA SNFinLLM-chat 61.36 vs Tongyi 51.50 (+9.86 pp)

Adding a calculator tool improves finance computation accuracy.

NumbersFinC best 52.01% (SNFinLLM-cal), +1.55 pp vs without-tool variant

DPO alignment helps some exam metrics but can reduce computation and MRC scores.

NumbersqEQA 65.33 (DPO) vs 61.36 (chat) +3.97 pp; FinC 48.46 (DPO) vs 50.46 (chat) −2.00 pp

Pretraining on domain data matters; skipping it degrades performance.

NumbersOpensource-refine < SNFinLLM-base by ≥4 pp on key benchmarks

Results

Accuracy

ValueSNFinLLM-base 63.94%

BaselineOpensource-base 59.30%

Accuracy

ValueSNFinLLM-base 54.32%

BaselineOpensource-base 50.32%

Finance computing (FinC)

ValueSNFinLLM-cal 52.01%

BaselineSNFinLLM-chat 50.46%

Qualification Exam QA (qEQA)

ValueSNFinLLM-dpo 65.33%

BaselineSNFinLLM-chat 61.36%

Who Should Care

What To Try In 7 Days

Collect a targeted finance corpus (news, reports, papers) and tokenize with SentencePiece, adding domain tokens.

Fine-tune a base LLM with a curated 10k–100k instruction-style dataset to test instruction-following gains.

Prototype a calculator integration (emit Python expressions and execute) for numeric QA to avoid arithmetic hallucinations.

Agent Features

Tool Use

  • calculator integration (Python executable)

Optimization Features

Training Optimization

  • SFT
  • cosine LR decay with warmup

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • No public release of code or datasets reported in the paper.
  • Some key improvements are modest (single-digit percentage points) and task-dependent.
  • DPO alignment can hurt computation and MRC performance; tuning required per use case.
  • Complex MRC (cMRC) still lags and needs further research.

When Not To Use

  • If you require out-of-the-box, reproducible models (code/data not provided).
  • For tasks with complex multi-document reasoning where cMRC performance is critical.
  • When you cannot afford full-parameter fine-tuning or large-domain pretraining costs.

Failure Modes

  • Arithmetic hallucinations if calculator integration is disabled or misused.
  • Reduced factual/computational accuracy after DPO if not validated on target tasks.
  • Overfitting to domain patterns if domain/general data ratio is poorly chosen.

Core Entities

Models

  • SNFinLLM
  • SNFinLLM-base
  • SNFinLLM-chat
  • SNFinLLM-dpo
  • SNFinLLM-cal
  • opensource-base
  • opensource-refine
  • Tongyi-Finance-14B
  • XuanYuan-13B

Metrics

  • Accuracy

Datasets

  • FinEval
  • FinanceIQ
  • qEQA
  • FinC
  • KQA
  • MRC
  • cMRC

Benchmarks

  • FinEval
  • FinanceIQ