First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

February 28, 20257 min

Overview

Decision SnapshotNeeds Validation

The dataset and toolkit are solid for research and evaluation, but retrieval and generation baselines show substantial gaps before safe production deployment.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 70%

Authors

Haitao Li, Yifan Chen, Yiran Hu, Qingyao Ai, Junjie Chen, Xiaoyu Yang, Jianhui Yang, Yueyue Wu, Zeyang Liu, Yiqun Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Legal RAG tools need both higher-quality retrievers and stronger legal reasoning; current off-the-shelf systems miss many citations and can produce unreliable legal detail.

Who Should Care

Summary TLDR

LexRAG is a new dataset and toolkit for testing retrieval-augmented generation (RAG) in multi-turn legal consultations in Chinese. It contains 1,013 five-turn conversations, 5,065 queries, and 17,228 legal articles. The authors provide LexiT (a modular RAG toolkit) and an LLM-as-a-judge pipeline. Benchmarks show dense retrievers plus query rewriting help but top retrieval Recall@10 only reaches 33.33%. Generation improves when ground-truth articles are provided, yet models still fall short of expert-level scores (reference best LLM judge ≈7.37 / 10). Code and data are on GitHub.

Problem Statement

There is no standard benchmark to evaluate how well RAG systems retrieve legal texts and generate legally sound answers across multi-turn, evolving consultations. That gap prevents consistent comparison and targeted improvement of legal RAG systems.

Main Contribution

LexRAG dataset: 1,013 multi-turn (5-turn) legal consultation dialogues, 17,228 statute articles, expert annotations.

LexiT toolkit: modular RAG pipeline, processors, retrievers, generators, and LLM-as-a-judge evaluation.

Key Findings

Best retrieval Recall@10 is low (hard task).

NumbersRecall@10 = 33.33% (GTE-Qwen2-1.5B, Query Rewrite)

Practical UseExpect many missed citations in legal RAG; prioritize better retrievers or tighter query rewriting before deployment.

Evidence RefTable 3

Dense retrievers beat lexical matching on multi-turn queries.

NumbersGTE Query Rewrite Recall@10 33.33% vs BM25 Query Rewrite 18.84%

Practical UseUse dense retrieval (and query rewrite) for pronoun-heavy legal chats rather than vanilla BM25.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Retrieval Recall@1033.33% (GTE-Qwen2-1.5B, Query Rewrite)BM25 Query Rewrite 18.84%+14.49 ppLexRAG (conversational knowledge retrieval)Table 3: GTE-Qwen2-1.5B Query Rewrite Recall@10 33.33%Table 3
Retrieval Recall@111.46% (GTE-Qwen2-1.5B, Query Rewrite)BM25 Query Rewrite 5.73%+5.73 ppLexRAGTable 3: Recall@1 valuesTable 3

What To Try In 7 Days

Clone LexiT and run the provided retrieval baselines on LexRAG to reproduce numbers.

Compare 'Last Query' vs 'Query Rewrite' processing for your retriever and log Recall@10.

Test Qwen-2.5-72B with and without ground-truth articles to measure practical uplift in your stack.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Dataset covers Chinese statutory law only; not multilingual or cross-jurisdictional.

Multi-turn dialogues were authored and expanded by legal experts, so they may miss real-user interaction diversity.

When Not To Use

Do not use LexRAG as the sole safety check for production legal advice without human lawyers.

Not suitable for evaluating non-Chinese legal systems or multilingual pipelines.

Failure Modes

Retriever misses relevant statutes due to pronouns and implicit legal intent.

Noisy or partial retrieved articles can mislead the generator and reduce judged quality.

Core Entities

Models

GLM-4GLM-4-flashGPT-3.5-turboGPT-4o-miniQwen-2.5-72B-InstructLLaMA-3.3-70B-InstructClaude-3.5-sonnetBGE-base-zhGTE-Qwen2-1.5B-instructtext-embedding-3

Metrics

Recall@knDCG@kAccuracyLLM-judge scoreROUGEBLEUBERTScore

Datasets

LexRAG

Benchmarks

LexRAG

Context Entities

Datasets

222 Chinese statutes (processed into 17,228 provisions)Legal Books (26,951 provisions)Legal Cases (2,370 guiding cases)