InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

Overview

Decision SnapshotNeeds Validation

The dataset and methods show clear task-level gains on evaluated splits, but performance varies by model and real-world integration (scale, privacy, and retrieval coverage) still needs testing.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Jing Ding, Kai Feng, Binbin Lin, Jiarui Cai, Qiushi Wang, Yu Xie, Xiaojin Zhang, Zhongyu Wei, Wei Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Insurance products are described across short FAQs, structured product databases, and long legal clauses. A unified benchmark plus task-specific pipelines (SQL-ReAct, RAG-ReAct) cut mistakes in automated answers and speed up building customer-facing QA tools.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

This paper releases InsQABench, a Chinese insurance QA benchmark with three parts: Commonsense QA (basic concepts), Database QA (structured SQL-backed queries), and Clause QA (long legal/contract text). The authors fine-tune open LLMs with LoRA and introduce SQL-ReAct (iterative SQL generation + execution) and RAG-ReAct (iterative retrieval + PDF parsing). Fine-tuning plus task-specific methods gives large gains on their tests (examples: Qwen1.5 + SQL-ReAct ACC 57.41 vs 35.27; GLM4 + RAG-ReAct AVG 83.63 vs 73.64). Data and code are reported as available under "InsQABench".

Problem Statement

Off-the-shelf LLMs struggle in insurance because knowledge is split across short commonsense facts, structured product/company tables, and long legal clauses. The field lacks an evaluation that covers these three real-world QA modes and task-specific methods to query databases and long documents accurately.

Main Contribution

InsQABench dataset covering three QA modes: Insurance Commonsense QA, Insurance Database QA, and Insurance Clause QA.

SQL-ReAct: an iterative ReAct-style loop for generating, executing, and refining SQL until an answer is found.

Key Findings

Supervised fine-tuning (LoRA) raises commonsense QA accuracy for GLM4-9B from 64.40 to 70.26.

NumbersACC +5.86 (64.40 → 70.26)

Practical UseFine-tune a general Chinese LLM on domain QAs with LoRA to get clearer, more accurate consumer-facing answers.

Evidence RefTable 4

SQL-ReAct plus fine-tuning substantially improves database QA; Qwen1.5 ACC rose from 35.27 to 57.41.

NumbersACC +22.14 (35.27 → 57.41)

Practical UseFor product-lookup or QA over company/product tables, use iterative SQL generation + execution and fine-tune to reduce wrong/missing query results.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Commonsense QA (model-based ACC) GLM4	70.26 (fine-tuned)	64.40 (base)	+5.86	Commonsense QA test (100 samples for model-based eval)	Table 4 shows GLM4 ACC improved from 64.40 to 70.26 after LoRA fine-tuning.	Table 4
Database QA ACC Qwen1.5	57.41 (fine-tuned + SQL-ReAct)	35.27 (two-rounds baseline)	+22.14	Database QA test set (546 examples)	Table 5 reports Qwen1.5 ACC 35.27 baseline and 57.41 after fine-tune+SQL-ReAct.	Table 5

What To Try In 7 Days

Fine-tune a base Chinese LLM with LoRA on 1–2k domain QAs to see quick accuracy gains.

Run SQL-ReAct on a small product database: generate SQL, execute, refine until correct result.

Parse 2–3 sample policy PDFs with Adobe Extract, index with BGE-M3+Faiss, and test RAG-ReAct retrieval and answer traceability.

Agent Features

Memory

multi-turn context accumulation (execution feedback stored in dialogue)

Planning

iterative reasoning and query refinement

Tool Use

SQL executiondense retrieverPDF parservector DB (Faiss)

Frameworks

ReActRAGLoRA

Is Agentic

Yes

Architectures

LLM with ReAct-style multi-turn loops

Optimization Features

Infra Optimization

Training on 3x NVIDIA L40s (48GB)

Model Optimization

LoRA

Training Optimization

LoRAdata augmentation via Evol-Instruct-like template evolution

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

InsQABench (paper states data and code available under project InsQABench)

Data URLs

InsQABench (paper states data and code available under project InsQABench)

Risks & Boundaries

Limitations

Training data generation and some answers were produced or refined by GPT-3.5/Gemini, which can introduce generation artifacts.

Evaluation partly relies on model-based scoring (GPT-4o), which can bias results toward models similar to the judge.

When Not To Use

For multilingual needs (dataset is Chinese-only) without additional localization.

When legal or regulatory workflows require certified human-only decisions.

Failure Modes

Hallucinated answers when retrieval misses relevant paragraphs.

SQL execution errors from entity name mismatches or abbreviation differences.

Core Entities

Models

Baichuan2-13BGLM4-9BQwen1.5-14BGPT-3.5GPT-4WenxinKimiChatPDFGemini 1.5 Pro

Metrics

ACCPROSIMPrecisionF1ROUGE-1ROUGE-LCompletenessClarityAVG

Datasets

InsQABenchInsurance Commonsense QAInsurance Database QAInsurance Clause QAInsuranceQA_zh (used as commonsense test source)

Benchmarks

InsQABench

Context Entities

Models

GPT-4o (used as judge in model-based evaluation)

Datasets

25k insurance products crawled for the database8k user QA crawled and 2k expert-written commonsense QAs (10k train)546 manual Database QA test examples870 Clause QA test examples (100 high-quality evaluated)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Supervised fine-tuning (LoRA) raises commonsense QA accuracy for GLM4-9B from 64.40 to 70.26.

SQL-ReAct plus fine-tuning substantially improves database QA; Qwen1.5 ACC rose from 35.27 to 57.41.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding