First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Overview

Decision SnapshotNeeds Validation

The dataset and toolkit are solid for research and evaluation, but retrieval and generation baselines show substantial gaps before safe production deployment.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 70%

Authors

Haitao Li, Yifan Chen, Yiran Hu, Qingyao Ai, Junjie Chen, Xiaoyu Yang, Jianhui Yang, Yueyue Wu, Zeyang Liu, Yiqun Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Legal RAG tools need both higher-quality retrievers and stronger legal reasoning; current off-the-shelf systems miss many citations and can produce unreliable legal detail.

Who Should Care

Product Manager ML Engineer CTO Data Scientist Founder

Summary TLDR

LexRAG is a new dataset and toolkit for testing retrieval-augmented generation (RAG) in multi-turn legal consultations in Chinese. It contains 1,013 five-turn conversations, 5,065 queries, and 17,228 legal articles. The authors provide LexiT (a modular RAG toolkit) and an LLM-as-a-judge pipeline. Benchmarks show dense retrievers plus query rewriting help but top retrieval Recall@10 only reaches 33.33%. Generation improves when ground-truth articles are provided, yet models still fall short of expert-level scores (reference best LLM judge ≈7.37 / 10). Code and data are on GitHub.

Problem Statement

There is no standard benchmark to evaluate how well RAG systems retrieve legal texts and generate legally sound answers across multi-turn, evolving consultations. That gap prevents consistent comparison and targeted improvement of legal RAG systems.

Main Contribution

LexRAG dataset: 1,013 multi-turn (5-turn) legal consultation dialogues, 17,228 statute articles, expert annotations.

LexiT toolkit: modular RAG pipeline, processors, retrievers, generators, and LLM-as-a-judge evaluation.

Key Findings

Best retrieval Recall@10 is low (hard task).

NumbersRecall@10 = 33.33% (GTE-Qwen2-1.5B, Query Rewrite)

Practical UseExpect many missed citations in legal RAG; prioritize better retrievers or tighter query rewriting before deployment.

Evidence RefTable 3

Dense retrievers beat lexical matching on multi-turn queries.

NumbersGTE Query Rewrite Recall@10 33.33% vs BM25 Query Rewrite 18.84%

Practical UseUse dense retrieval (and query rewrite) for pronoun-heavy legal chats rather than vanilla BM25.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Retrieval Recall@10	33.33% (GTE-Qwen2-1.5B, Query Rewrite)	BM25 Query Rewrite 18.84%	+14.49 pp	LexRAG (conversational knowledge retrieval)	Table 3: GTE-Qwen2-1.5B Query Rewrite Recall@10 33.33%	Table 3
Retrieval Recall@1	11.46% (GTE-Qwen2-1.5B, Query Rewrite)	BM25 Query Rewrite 5.73%	+5.73 pp	LexRAG	Table 3: Recall@1 values	Table 3

What To Try In 7 Days

Clone LexiT and run the provided retrieval baselines on LexRAG to reproduce numbers.

Compare 'Last Query' vs 'Query Rewrite' processing for your retriever and log Recall@10.

Test Qwen-2.5-72B with and without ground-truth articles to measure practical uplift in your stack.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/CSHaitao/LexRAG

Data URLs

https://github.com/CSHaitao/LexRAG

Risks & Boundaries

Limitations

Dataset covers Chinese statutory law only; not multilingual or cross-jurisdictional.

Multi-turn dialogues were authored and expanded by legal experts, so they may miss real-user interaction diversity.

When Not To Use

Do not use LexRAG as the sole safety check for production legal advice without human lawyers.

Not suitable for evaluating non-Chinese legal systems or multilingual pipelines.

Failure Modes

Retriever misses relevant statutes due to pronouns and implicit legal intent.

Noisy or partial retrieved articles can mislead the generator and reduce judged quality.

Core Entities

Models

GLM-4GLM-4-flashGPT-3.5-turboGPT-4o-miniQwen-2.5-72B-InstructLLaMA-3.3-70B-InstructClaude-3.5-sonnetBGE-base-zhGTE-Qwen2-1.5B-instructtext-embedding-3

Metrics

Recall@knDCG@kAccuracyLLM-judge scoreROUGEBLEUBERTScore

Datasets

LexRAG

Benchmarks

LexRAG

Context Entities

Datasets

222 Chinese statutes (processed into 17,228 provisions)Legal Books (26,951 provisions)Legal Cases (2,370 guiding cases)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Best retrieval Recall@10 is low (hard task).

Dense retrievers beat lexical matching on multi-turn queries.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding