SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Overview

Decision SnapshotNeeds Validation

The paper shows solid benchmark and internal-test gains from domain pretraining and a calculator tool, but lacks public code/data and leaves mixed effects from DPO, so apply after in-house validation.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 45%

Authors

Shujuan Zhao, Lingfeng Qiao, Kangyang Luo, Qian-Wen Zhang, Junru Lu, Di Yin

Links

Abstract / PDF

Why It Matters For Business

Domain pre-training plus instruction tuning yields measurable accuracy gains on finance QA and exam tasks; adding a calculator reduces numeric errors—useful for advisory, research automation, and computation-heavy workflows.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

SNFinLLM is a Chinese financial assistant built by (1) continuous domain pre-training on ~100B unsupervised tokens (25B finance), (2) 550k supervised instruction examples for full-parameter fine-tuning, and (3) a Direct Preference Optimization (DPO) step. The authors add a Python-executable calculator expression to handle numeric tasks. On standard benchmarks (FinEval, FinanceIQ) and internal finance test sets, SNFinLLM variants beat an open-source baseline and several Chinese financial LLMs on many tasks. Calculator integration raises finance-computation accuracy; DPO improves some exam-style metrics but can hurt computation and MRC. Code and data availability are not stated.

Problem Statement

Generic and existing finance LLMs suffer hallucinations, weak calculation accuracy, and limited instruction-following in Chinese finance tasks. The paper aims to build a Chinese financial LLM that (a) learns domain knowledge, (b) follows finance-style instructions, and (c) performs reliable numeric computations.

Main Contribution

A three-stage training pipeline: continuous domain pre-training, full-parameter supervised fine-tuning (550k instructions), then DPO alignment.

Curated financial corpora: 25B finance tokens plus general data to total ~100B unsupervised tokens and 7,689 new domain tokens.

Key Findings

Domain continuous pre-training raises benchmark accuracy.

NumbersFEval +4.64 pp (59.30 → 63.94)

Practical UseDo a dedicated domain pre-training pass with sizable finance text before instruction tuning to get several percentage points of accuracy on finance QA.

Evidence RefTable 1 Benchmark Results

Instruction fine-tuning produces a stronger instruction-following assistant and improves exam-style tasks vs peers.

NumbersqEQA SNFinLLM-chat 61.36 vs Tongyi 51.50 (+9.86 pp)

Practical UseInvest in full-parameter SFT with carefully curated finance instructions to boost practical QA and exam-like performance against off-the-shelf finance models.

Evidence RefTable 1 Self-evaluation Results

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	SNFinLLM-base 63.94%	Opensource-base 59.30%	+4.64 pp	FinEval	Table 1 Benchmark Results	Table 1
Accuracy	SNFinLLM-base 54.32%	Opensource-base 50.32%	+4.00 pp	FinanceIQ	Table 1 Benchmark Results	Table 1

What To Try In 7 Days

Collect a targeted finance corpus (news, reports, papers) and tokenize with SentencePiece, adding domain tokens.

Fine-tune a base LLM with a curated 10k–100k instruction-style dataset to test instruction-following gains.

Prototype a calculator integration (emit Python expressions and execute) for numeric QA to avoid arithmetic hallucinations.

Agent Features

Tool Use

calculator integration (Python executable)

Optimization Features

Training Optimization

SFTcosine LR decay with warmup

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

No public release of code or datasets reported in the paper.

Some key improvements are modest (single-digit percentage points) and task-dependent.

When Not To Use

If you require out-of-the-box, reproducible models (code/data not provided).

For tasks with complex multi-document reasoning where cMRC performance is critical.

Failure Modes

Arithmetic hallucinations if calculator integration is disabled or misused.

Reduced factual/computational accuracy after DPO if not validated on target tasks.

Core Entities

Models

SNFinLLMSNFinLLM-baseSNFinLLM-chatSNFinLLM-dpoSNFinLLM-calopensource-baseopensource-refineTongyi-Finance-14BXuanYuan-13B

Metrics

Accuracy

Datasets

FinEvalFinanceIQqEQAFinCKQAMRCcMRC

Benchmarks

FinEvalFinanceIQ

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Domain continuous pre-training raises benchmark accuracy.

Instruction fine-tuning produces a stronger instruction-following assistant and improves exam-style tasks vs peers.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

Key finding

ChipExpert: Open-source LLM tuned for integrated-circuit design

Key finding

Build a modular Chinese financial LLM by instruction data and four task-specific LoRA experts

Key finding