InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

January 19, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Jing Ding, Kai Feng, Binbin Lin, Jiarui Cai, Qiushi Wang, Yu Xie, Xiaojin Zhang, Zhongyu Wei, Wei Chen

Links

Abstract / PDF

Why It Matters For Business

Insurance products are described across short FAQs, structured product databases, and long legal clauses. A unified benchmark plus task-specific pipelines (SQL-ReAct, RAG-ReAct) cut mistakes in automated answers and speed up building customer-facing QA tools.

Summary TLDR

This paper releases InsQABench, a Chinese insurance QA benchmark with three parts: Commonsense QA (basic concepts), Database QA (structured SQL-backed queries), and Clause QA (long legal/contract text). The authors fine-tune open LLMs with LoRA and introduce SQL-ReAct (iterative SQL generation + execution) and RAG-ReAct (iterative retrieval + PDF parsing). Fine-tuning plus task-specific methods gives large gains on their tests (examples: Qwen1.5 + SQL-ReAct ACC 57.41 vs 35.27; GLM4 + RAG-ReAct AVG 83.63 vs 73.64). Data and code are reported as available under "InsQABench".

Problem Statement

Off-the-shelf LLMs struggle in insurance because knowledge is split across short commonsense facts, structured product/company tables, and long legal clauses. The field lacks an evaluation that covers these three real-world QA modes and task-specific methods to query databases and long documents accurately.

Main Contribution

InsQABench dataset covering three QA modes: Insurance Commonsense QA, Insurance Database QA, and Insurance Clause QA.

SQL-ReAct: an iterative ReAct-style loop for generating, executing, and refining SQL until an answer is found.

RAG-ReAct: a retrieval-augmented pipeline that pairs rule-enhanced PDF parsing with iterative retrieval and reasoning.

Demonstration that supervised fine-tuning (LoRA) plus the two methods significantly improves open models and can beat some closed models on these tasks.

Released demos and stated availability of data and code under the InsQABench project.

Key Findings

Supervised fine-tuning (LoRA) raises commonsense QA accuracy for GLM4-9B from 64.40 to 70.26.

NumbersACC +5.86 (64.40 → 70.26)

SQL-ReAct plus fine-tuning substantially improves database QA; Qwen1.5 ACC rose from 35.27 to 57.41.

NumbersACC +22.14 (35.27 → 57.41)

RAG-ReAct improves clause QA over standard RAG; GLM4 (fine-tuned) AVG rose 73.64 → 83.63.

NumbersAVG +9.99 (73.64 → 83.63)

Results

Commonsense QA (model-based ACC) GLM4

Value70.26 (fine-tuned)

Baseline64.40 (base)

Database QA ACC Qwen1.5

Value57.41 (fine-tuned + SQL-ReAct)

Baseline35.27 (two-rounds baseline)

Database QA ACC Baichuan2

Value52.50 (fine-tuned + SQL-ReAct)

Baseline4.89 (two-rounds baseline)

Clause QA AVG GLM4

Value83.63 (fine-tuned + RAG-ReAct)

Baseline73.64 (fine-tuned + RAG)

Clause QA AVG Qwen1.5

Value73.06 (fine-tuned + RAG-ReAct)

Baseline72.83 (fine-tuned + RAG)

Who Should Care

What To Try In 7 Days

Fine-tune a base Chinese LLM with LoRA on 1–2k domain QAs to see quick accuracy gains.

Run SQL-ReAct on a small product database: generate SQL, execute, refine until correct result.

Parse 2–3 sample policy PDFs with Adobe Extract, index with BGE-M3+Faiss, and test RAG-ReAct retrieval and answer traceability.

Agent Features

Memory

  • multi-turn context accumulation (execution feedback stored in dialogue)

Planning

  • iterative reasoning and query refinement

Tool Use

  • SQL execution
  • dense retriever
  • PDF parser
  • vector DB (Faiss)

Frameworks

  • ReAct
  • RAG
  • LoRA

Is Agentic

true

Architectures

  • LLM with ReAct-style multi-turn loops

Optimization Features

Infra Optimization

  • Training on 3x NVIDIA L40s (48GB)

Model Optimization

  • LoRA

Training Optimization

  • LoRA
  • data augmentation via Evol-Instruct-like template evolution

Reproducibility

Code Urls

  • InsQABench (paper states data and code available under project InsQABench)

Data Urls

  • InsQABench (paper states data and code available under project InsQABench)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training data generation and some answers were produced or refined by GPT-3.5/Gemini, which can introduce generation artifacts.
  • Evaluation partly relies on model-based scoring (GPT-4o), which can bias results toward models similar to the judge.
  • Dataset is Chinese-only and focuses on insurance; results may not transfer to other languages or domains.
  • Clause QA relies on heuristic PDF parsing rules that may not generalize to radically different layouts.

When Not To Use

  • For multilingual needs (dataset is Chinese-only) without additional localization.
  • When legal or regulatory workflows require certified human-only decisions.
  • For documents with heavy non-textual elements (images or scanned handwriting) where Adobe Extract rules may fail.

Failure Modes

  • Hallucinated answers when retrieval misses relevant paragraphs.
  • SQL execution errors from entity name mismatches or abbreviation differences.
  • Overconfidence in short training-generated answers leading to missing nuance in legal clauses.

Core Entities

Models

  • Baichuan2-13B
  • GLM4-9B
  • Qwen1.5-14B
  • GPT-3.5
  • GPT-4
  • Wenxin
  • Kimi
  • ChatPDF
  • Gemini 1.5 Pro

Metrics

  • ACC
  • PRO
  • SIM
  • Precision
  • F1
  • ROUGE-1
  • ROUGE-L
  • Completeness
  • Clarity
  • AVG

Datasets

  • InsQABench
  • Insurance Commonsense QA
  • Insurance Database QA
  • Insurance Clause QA
  • InsuranceQA_zh (used as commonsense test source)

Benchmarks

  • InsQABench

Context Entities

Models

  • GPT-4o (used as judge in model-based evaluation)

Datasets

  • 25k insurance products crawled for the database
  • 8k user QA crawled and 2k expert-written commonsense QAs (10k train)
  • 546 manual Database QA test examples
  • 870 Clause QA test examples (100 high-quality evaluated)