SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

August 5, 20247 min

Overview

Decision SnapshotNeeds Validation

The paper shows solid benchmark and internal-test gains from domain pretraining and a calculator tool, but lacks public code/data and leaves mixed effects from DPO, so apply after in-house validation.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 45%

Authors

Shujuan Zhao, Lingfeng Qiao, Kangyang Luo, Qian-Wen Zhang, Junru Lu, Di Yin

Links

Abstract / PDF

Why It Matters For Business

Domain pre-training plus instruction tuning yields measurable accuracy gains on finance QA and exam tasks; adding a calculator reduces numeric errors—useful for advisory, research automation, and computation-heavy workflows.

Who Should Care

Summary TLDR

SNFinLLM is a Chinese financial assistant built by (1) continuous domain pre-training on ~100B unsupervised tokens (25B finance), (2) 550k supervised instruction examples for full-parameter fine-tuning, and (3) a Direct Preference Optimization (DPO) step. The authors add a Python-executable calculator expression to handle numeric tasks. On standard benchmarks (FinEval, FinanceIQ) and internal finance test sets, SNFinLLM variants beat an open-source baseline and several Chinese financial LLMs on many tasks. Calculator integration raises finance-computation accuracy; DPO improves some exam-style metrics but can hurt computation and MRC. Code and data availability are not stated.

Problem Statement

Generic and existing finance LLMs suffer hallucinations, weak calculation accuracy, and limited instruction-following in Chinese finance tasks. The paper aims to build a Chinese financial LLM that (a) learns domain knowledge, (b) follows finance-style instructions, and (c) performs reliable numeric computations.

Main Contribution

A three-stage training pipeline: continuous domain pre-training, full-parameter supervised fine-tuning (550k instructions), then DPO alignment.

Curated financial corpora: 25B finance tokens plus general data to total ~100B unsupervised tokens and 7,689 new domain tokens.

Key Findings

Domain continuous pre-training raises benchmark accuracy.

NumbersFEval +4.64 pp (59.3063.94)

Practical UseDo a dedicated domain pre-training pass with sizable finance text before instruction tuning to get several percentage points of accuracy on finance QA.

Evidence RefTable 1 Benchmark Results

Instruction fine-tuning produces a stronger instruction-following assistant and improves exam-style tasks vs peers.

NumbersqEQA SNFinLLM-chat 61.36 vs Tongyi 51.50 (+9.86 pp)

Practical UseInvest in full-parameter SFT with carefully curated finance instructions to boost practical QA and exam-like performance against off-the-shelf finance models.

Evidence RefTable 1 Self-evaluation Results

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracySNFinLLM-base 63.94%Opensource-base 59.30%+4.64 ppFinEvalTable 1 Benchmark ResultsTable 1
AccuracySNFinLLM-base 54.32%Opensource-base 50.32%+4.00 ppFinanceIQTable 1 Benchmark ResultsTable 1

What To Try In 7 Days

Collect a targeted finance corpus (news, reports, papers) and tokenize with SentencePiece, adding domain tokens.

Fine-tune a base LLM with a curated 10k–100k instruction-style dataset to test instruction-following gains.

Prototype a calculator integration (emit Python expressions and execute) for numeric QA to avoid arithmetic hallucinations.

Agent Features

Tool Use
calculator integration (Python executable)

Optimization Features

Training Optimization
SFTcosine LR decay with warmup

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

No public release of code or datasets reported in the paper.

Some key improvements are modest (single-digit percentage points) and task-dependent.

When Not To Use

If you require out-of-the-box, reproducible models (code/data not provided).

For tasks with complex multi-document reasoning where cMRC performance is critical.

Failure Modes

Arithmetic hallucinations if calculator integration is disabled or misused.

Reduced factual/computational accuracy after DPO if not validated on target tasks.

Core Entities

Models

SNFinLLMSNFinLLM-baseSNFinLLM-chatSNFinLLM-dpoSNFinLLM-calopensource-baseopensource-refineTongyi-Finance-14BXuanYuan-13B

Metrics

Accuracy

Datasets

FinEvalFinanceIQqEQAFinCKQAMRCcMRC

Benchmarks

FinEvalFinanceIQ