Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
Insurance products are described across short FAQs, structured product databases, and long legal clauses. A unified benchmark plus task-specific pipelines (SQL-ReAct, RAG-ReAct) cut mistakes in automated answers and speed up building customer-facing QA tools.
Summary TLDR
This paper releases InsQABench, a Chinese insurance QA benchmark with three parts: Commonsense QA (basic concepts), Database QA (structured SQL-backed queries), and Clause QA (long legal/contract text). The authors fine-tune open LLMs with LoRA and introduce SQL-ReAct (iterative SQL generation + execution) and RAG-ReAct (iterative retrieval + PDF parsing). Fine-tuning plus task-specific methods gives large gains on their tests (examples: Qwen1.5 + SQL-ReAct ACC 57.41 vs 35.27; GLM4 + RAG-ReAct AVG 83.63 vs 73.64). Data and code are reported as available under "InsQABench".
Problem Statement
Off-the-shelf LLMs struggle in insurance because knowledge is split across short commonsense facts, structured product/company tables, and long legal clauses. The field lacks an evaluation that covers these three real-world QA modes and task-specific methods to query databases and long documents accurately.
Main Contribution
InsQABench dataset covering three QA modes: Insurance Commonsense QA, Insurance Database QA, and Insurance Clause QA.
SQL-ReAct: an iterative ReAct-style loop for generating, executing, and refining SQL until an answer is found.
RAG-ReAct: a retrieval-augmented pipeline that pairs rule-enhanced PDF parsing with iterative retrieval and reasoning.
Demonstration that supervised fine-tuning (LoRA) plus the two methods significantly improves open models and can beat some closed models on these tasks.
Released demos and stated availability of data and code under the InsQABench project.
Key Findings
Supervised fine-tuning (LoRA) raises commonsense QA accuracy for GLM4-9B from 64.40 to 70.26.
SQL-ReAct plus fine-tuning substantially improves database QA; Qwen1.5 ACC rose from 35.27 to 57.41.
RAG-ReAct improves clause QA over standard RAG; GLM4 (fine-tuned) AVG rose 73.64 → 83.63.
Results
Commonsense QA (model-based ACC) GLM4
Database QA ACC Qwen1.5
Database QA ACC Baichuan2
Clause QA AVG GLM4
Clause QA AVG Qwen1.5
Who Should Care
What To Try In 7 Days
Fine-tune a base Chinese LLM with LoRA on 1–2k domain QAs to see quick accuracy gains.
Run SQL-ReAct on a small product database: generate SQL, execute, refine until correct result.
Parse 2–3 sample policy PDFs with Adobe Extract, index with BGE-M3+Faiss, and test RAG-ReAct retrieval and answer traceability.
Agent Features
Memory
- multi-turn context accumulation (execution feedback stored in dialogue)
Planning
- iterative reasoning and query refinement
Tool Use
- SQL execution
- dense retriever
- PDF parser
- vector DB (Faiss)
Frameworks
- ReAct
- RAG
- LoRA
Is Agentic
true
Architectures
- LLM with ReAct-style multi-turn loops
Optimization Features
Infra Optimization
- Training on 3x NVIDIA L40s (48GB)
Model Optimization
- LoRA
Training Optimization
- LoRA
- data augmentation via Evol-Instruct-like template evolution
Reproducibility
Code Urls
- InsQABench (paper states data and code available under project InsQABench)
Data Urls
- InsQABench (paper states data and code available under project InsQABench)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Training data generation and some answers were produced or refined by GPT-3.5/Gemini, which can introduce generation artifacts.
- Evaluation partly relies on model-based scoring (GPT-4o), which can bias results toward models similar to the judge.
- Dataset is Chinese-only and focuses on insurance; results may not transfer to other languages or domains.
- Clause QA relies on heuristic PDF parsing rules that may not generalize to radically different layouts.
When Not To Use
- For multilingual needs (dataset is Chinese-only) without additional localization.
- When legal or regulatory workflows require certified human-only decisions.
- For documents with heavy non-textual elements (images or scanned handwriting) where Adobe Extract rules may fail.
Failure Modes
- Hallucinated answers when retrieval misses relevant paragraphs.
- SQL execution errors from entity name mismatches or abbreviation differences.
- Overconfidence in short training-generated answers leading to missing nuance in legal clauses.
Core Entities
Models
- Baichuan2-13B
- GLM4-9B
- Qwen1.5-14B
- GPT-3.5
- GPT-4
- Wenxin
- Kimi
- ChatPDF
- Gemini 1.5 Pro
Metrics
- ACC
- PRO
- SIM
- Precision
- F1
- ROUGE-1
- ROUGE-L
- Completeness
- Clarity
- AVG
Datasets
- InsQABench
- Insurance Commonsense QA
- Insurance Database QA
- Insurance Clause QA
- InsuranceQA_zh (used as commonsense test source)
Benchmarks
- InsQABench
Context Entities
Models
- GPT-4o (used as judge in model-based evaluation)
Datasets
- 25k insurance products crawled for the database
- 8k user QA crawled and 2k expert-written commonsense QAs (10k train)
- 546 manual Database QA test examples
- 870 Clause QA test examples (100 high-quality evaluated)

