First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

February 28, 20257 min

Overview

Production Readiness

0.4

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Haitao Li, Yifan Chen, Yiran Hu, Qingyao Ai, Junjie Chen, Xiaoyu Yang, Jianhui Yang, Yueyue Wu, Zeyang Liu, Yiqun Liu

Links

Abstract / PDF

Why It Matters For Business

Legal RAG tools need both higher-quality retrievers and stronger legal reasoning; current off-the-shelf systems miss many citations and can produce unreliable legal detail.

Summary TLDR

LexRAG is a new dataset and toolkit for testing retrieval-augmented generation (RAG) in multi-turn legal consultations in Chinese. It contains 1,013 five-turn conversations, 5,065 queries, and 17,228 legal articles. The authors provide LexiT (a modular RAG toolkit) and an LLM-as-a-judge pipeline. Benchmarks show dense retrievers plus query rewriting help but top retrieval Recall@10 only reaches 33.33%. Generation improves when ground-truth articles are provided, yet models still fall short of expert-level scores (reference best LLM judge ≈7.37 / 10). Code and data are on GitHub.

Problem Statement

There is no standard benchmark to evaluate how well RAG systems retrieve legal texts and generate legally sound answers across multi-turn, evolving consultations. That gap prevents consistent comparison and targeted improvement of legal RAG systems.

Main Contribution

LexRAG dataset: 1,013 multi-turn (5-turn) legal consultation dialogues, 17,228 statute articles, expert annotations.

LexiT toolkit: modular RAG pipeline, processors, retrievers, generators, and LLM-as-a-judge evaluation.

Systematic benchmark: retrieval and generation baselines showing current limits and failure modes in legal multi-turn RAG.

Key Findings

Best retrieval Recall@10 is low (hard task).

NumbersRecall@10 = 33.33% (GTE-Qwen2-1.5B, Query Rewrite)

Dense retrievers beat lexical matching on multi-turn queries.

NumbersGTE Query Rewrite Recall@10 33.33% vs BM25 Query Rewrite 18.84%

Providing ground-truth legal articles raises keyword coverage and judge scores.

NumbersQwen-2.5 reference accuracy 53.24% vs zero-shot 40.83%; judge 7.37 vs 7.24

Retriever outputs can be noisy and not always helpful.

NumbersQwen-2.5 retriever judge 7.09 vs reference 7.37 (drop 0.28)

Annotated dataset and expert-vetted references exist.

Numbers1,013 conversations; 5,065 queries; 17,228 legal articles

Results

Retrieval Recall@10

Value33.33% (GTE-Qwen2-1.5B, Query Rewrite)

BaselineBM25 Query Rewrite 18.84%

Retrieval Recall@1

Value11.46% (GTE-Qwen2-1.5B, Query Rewrite)

BaselineBM25 Query Rewrite 5.73%

Accuracy

ValueQwen-2.5 Reference 53.24% vs Zero-shot 40.83%

BaselineZero-shot Qwen-2.5 40.83%

LLM-judge score (ALL)

ValueQwen-2.5 Reference 7.37 / 10; Retriever 7.09; Zero-shot 7.24

BaselineZero-shot 7.24

Who Should Care

What To Try In 7 Days

Clone LexiT and run the provided retrieval baselines on LexRAG to reproduce numbers.

Compare 'Last Query' vs 'Query Rewrite' processing for your retriever and log Recall@10.

Test Qwen-2.5-72B with and without ground-truth articles to measure practical uplift in your stack.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Dataset covers Chinese statutory law only; not multilingual or cross-jurisdictional.
  • Multi-turn dialogues were authored and expanded by legal experts, so they may miss real-user interaction diversity.
  • Retriever and judge evaluations depend on current LLMs and may reflect judge-model biases.

When Not To Use

  • Do not use LexRAG as the sole safety check for production legal advice without human lawyers.
  • Not suitable for evaluating non-Chinese legal systems or multilingual pipelines.

Failure Modes

  • Retriever misses relevant statutes due to pronouns and implicit legal intent.
  • Noisy or partial retrieved articles can mislead the generator and reduce judged quality.
  • LLM-as-a-judge may inherit biases or mis-evaluate nuanced legal reasoning.

Core Entities

Models

  • GLM-4
  • GLM-4-flash
  • GPT-3.5-turbo
  • GPT-4o-mini
  • Qwen-2.5-72B-Instruct
  • LLaMA-3.3-70B-Instruct
  • Claude-3.5-sonnet
  • BGE-base-zh
  • GTE-Qwen2-1.5B-instruct
  • text-embedding-3

Metrics

  • Recall@k
  • nDCG@k
  • Accuracy
  • LLM-judge score
  • ROUGE
  • BLEU
  • BERTScore

Datasets

  • LexRAG

Benchmarks

  • LexRAG

Context Entities

Datasets

  • 222 Chinese statutes (processed into 17,228 provisions)
  • Legal Books (26,951 provisions)
  • Legal Cases (2,370 guiding cases)