Overview
The dataset and methods show clear task-level gains on evaluated splits, but performance varies by model and real-world integration (scale, privacy, and retrieval coverage) still needs testing.
Citations1
Evidence Strength0.60
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Insurance products are described across short FAQs, structured product databases, and long legal clauses. A unified benchmark plus task-specific pipelines (SQL-ReAct, RAG-ReAct) cut mistakes in automated answers and speed up building customer-facing QA tools.
Who Should Care
Summary TLDR
This paper releases InsQABench, a Chinese insurance QA benchmark with three parts: Commonsense QA (basic concepts), Database QA (structured SQL-backed queries), and Clause QA (long legal/contract text). The authors fine-tune open LLMs with LoRA and introduce SQL-ReAct (iterative SQL generation + execution) and RAG-ReAct (iterative retrieval + PDF parsing). Fine-tuning plus task-specific methods gives large gains on their tests (examples: Qwen1.5 + SQL-ReAct ACC 57.41 vs 35.27; GLM4 + RAG-ReAct AVG 83.63 vs 73.64). Data and code are reported as available under "InsQABench".
Problem Statement
Off-the-shelf LLMs struggle in insurance because knowledge is split across short commonsense facts, structured product/company tables, and long legal clauses. The field lacks an evaluation that covers these three real-world QA modes and task-specific methods to query databases and long documents accurately.
Main Contribution
InsQABench dataset covering three QA modes: Insurance Commonsense QA, Insurance Database QA, and Insurance Clause QA.
SQL-ReAct: an iterative ReAct-style loop for generating, executing, and refining SQL until an answer is found.
Key Findings
Supervised fine-tuning (LoRA) raises commonsense QA accuracy for GLM4-9B from 64.40 to 70.26.
SQL-ReAct plus fine-tuning substantially improves database QA; Qwen1.5 ACC rose from 35.27 to 57.41.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Commonsense QA (model-based ACC) GLM4 | 70.26 (fine-tuned) | 64.40 (base) | +5.86 | Commonsense QA test (100 samples for model-based eval) | Table 4 shows GLM4 ACC improved from 64.40 to 70.26 after LoRA fine-tuning. | Table 4 |
| Database QA ACC Qwen1.5 | 57.41 (fine-tuned + SQL-ReAct) | 35.27 (two-rounds baseline) | +22.14 | Database QA test set (546 examples) | Table 5 reports Qwen1.5 ACC 35.27 baseline and 57.41 after fine-tune+SQL-ReAct. | Table 5 |
What To Try In 7 Days
Fine-tune a base Chinese LLM with LoRA on 1–2k domain QAs to see quick accuracy gains.
Run SQL-ReAct on a small product database: generate SQL, execute, refine until correct result.
Parse 2–3 sample policy PDFs with Adobe Extract, index with BGE-M3+Faiss, and test RAG-ReAct retrieval and answer traceability.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Infra Optimization
Model Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Training data generation and some answers were produced or refined by GPT-3.5/Gemini, which can introduce generation artifacts.
Evaluation partly relies on model-based scoring (GPT-4o), which can bias results toward models similar to the judge.
When Not To Use
For multilingual needs (dataset is Chinese-only) without additional localization.
When legal or regulatory workflows require certified human-only decisions.
Failure Modes
Hallucinated answers when retrieval misses relevant paragraphs.
SQL execution errors from entity name mismatches or abbreviation differences.

