InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

January 19, 20258 min

Overview

Decision SnapshotNeeds Validation

The dataset and methods show clear task-level gains on evaluated splits, but performance varies by model and real-world integration (scale, privacy, and retrieval coverage) still needs testing.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Jing Ding, Kai Feng, Binbin Lin, Jiarui Cai, Qiushi Wang, Yu Xie, Xiaojin Zhang, Zhongyu Wei, Wei Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Insurance products are described across short FAQs, structured product databases, and long legal clauses. A unified benchmark plus task-specific pipelines (SQL-ReAct, RAG-ReAct) cut mistakes in automated answers and speed up building customer-facing QA tools.

Who Should Care

Summary TLDR

This paper releases InsQABench, a Chinese insurance QA benchmark with three parts: Commonsense QA (basic concepts), Database QA (structured SQL-backed queries), and Clause QA (long legal/contract text). The authors fine-tune open LLMs with LoRA and introduce SQL-ReAct (iterative SQL generation + execution) and RAG-ReAct (iterative retrieval + PDF parsing). Fine-tuning plus task-specific methods gives large gains on their tests (examples: Qwen1.5 + SQL-ReAct ACC 57.41 vs 35.27; GLM4 + RAG-ReAct AVG 83.63 vs 73.64). Data and code are reported as available under "InsQABench".

Problem Statement

Off-the-shelf LLMs struggle in insurance because knowledge is split across short commonsense facts, structured product/company tables, and long legal clauses. The field lacks an evaluation that covers these three real-world QA modes and task-specific methods to query databases and long documents accurately.

Main Contribution

InsQABench dataset covering three QA modes: Insurance Commonsense QA, Insurance Database QA, and Insurance Clause QA.

SQL-ReAct: an iterative ReAct-style loop for generating, executing, and refining SQL until an answer is found.

Key Findings

Supervised fine-tuning (LoRA) raises commonsense QA accuracy for GLM4-9B from 64.40 to 70.26.

NumbersACC +5.86 (64.4070.26)

Practical UseFine-tune a general Chinese LLM on domain QAs with LoRA to get clearer, more accurate consumer-facing answers.

Evidence RefTable 4

SQL-ReAct plus fine-tuning substantially improves database QA; Qwen1.5 ACC rose from 35.27 to 57.41.

NumbersACC +22.14 (35.2757.41)

Practical UseFor product-lookup or QA over company/product tables, use iterative SQL generation + execution and fine-tune to reduce wrong/missing query results.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Commonsense QA (model-based ACC) GLM470.26 (fine-tuned)64.40 (base)+5.86Commonsense QA test (100 samples for model-based eval)Table 4 shows GLM4 ACC improved from 64.40 to 70.26 after LoRA fine-tuning.Table 4
Database QA ACC Qwen1.557.41 (fine-tuned + SQL-ReAct)35.27 (two-rounds baseline)+22.14Database QA test set (546 examples)Table 5 reports Qwen1.5 ACC 35.27 baseline and 57.41 after fine-tune+SQL-ReAct.Table 5

What To Try In 7 Days

Fine-tune a base Chinese LLM with LoRA on 1–2k domain QAs to see quick accuracy gains.

Run SQL-ReAct on a small product database: generate SQL, execute, refine until correct result.

Parse 2–3 sample policy PDFs with Adobe Extract, index with BGE-M3+Faiss, and test RAG-ReAct retrieval and answer traceability.

Agent Features

Memory
multi-turn context accumulation (execution feedback stored in dialogue)
Planning
iterative reasoning and query refinement
Tool Use
SQL executiondense retrieverPDF parservector DB (Faiss)
Frameworks
ReActRAGLoRA
Is Agentic

Yes

Architectures
LLM with ReAct-style multi-turn loops

Optimization Features

Infra Optimization
Training on 3x NVIDIA L40s (48GB)
Model Optimization
LoRA
Training Optimization
LoRAdata augmentation via Evol-Instruct-like template evolution

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Code URLs

InsQABench (paper states data and code available under project InsQABench)

Data URLs

InsQABench (paper states data and code available under project InsQABench)

Risks & Boundaries

Limitations

Training data generation and some answers were produced or refined by GPT-3.5/Gemini, which can introduce generation artifacts.

Evaluation partly relies on model-based scoring (GPT-4o), which can bias results toward models similar to the judge.

When Not To Use

For multilingual needs (dataset is Chinese-only) without additional localization.

When legal or regulatory workflows require certified human-only decisions.

Failure Modes

Hallucinated answers when retrieval misses relevant paragraphs.

SQL execution errors from entity name mismatches or abbreviation differences.

Core Entities

Models

Baichuan2-13BGLM4-9BQwen1.5-14BGPT-3.5GPT-4WenxinKimiChatPDFGemini 1.5 Pro

Metrics

ACCPROSIMPrecisionF1ROUGE-1ROUGE-LCompletenessClarityAVG

Datasets

InsQABenchInsurance Commonsense QAInsurance Database QAInsurance Clause QAInsuranceQA_zh (used as commonsense test source)

Benchmarks

InsQABench

Context Entities

Models

GPT-4o (used as judge in model-based evaluation)

Datasets

25k insurance products crawled for the database8k user QA crawled and 2k expert-written commonsense QAs (10k train)546 manual Database QA test examples870 Clause QA test examples (100 high-quality evaluated)