Overview
The dataset and toolkit are solid for research and evaluation, but retrieval and generation baselines show substantial gaps before safe production deployment.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 70%
Why It Matters For Business
Legal RAG tools need both higher-quality retrievers and stronger legal reasoning; current off-the-shelf systems miss many citations and can produce unreliable legal detail.
Who Should Care
Summary TLDR
LexRAG is a new dataset and toolkit for testing retrieval-augmented generation (RAG) in multi-turn legal consultations in Chinese. It contains 1,013 five-turn conversations, 5,065 queries, and 17,228 legal articles. The authors provide LexiT (a modular RAG toolkit) and an LLM-as-a-judge pipeline. Benchmarks show dense retrievers plus query rewriting help but top retrieval Recall@10 only reaches 33.33%. Generation improves when ground-truth articles are provided, yet models still fall short of expert-level scores (reference best LLM judge ≈7.37 / 10). Code and data are on GitHub.
Problem Statement
There is no standard benchmark to evaluate how well RAG systems retrieve legal texts and generate legally sound answers across multi-turn, evolving consultations. That gap prevents consistent comparison and targeted improvement of legal RAG systems.
Main Contribution
LexRAG dataset: 1,013 multi-turn (5-turn) legal consultation dialogues, 17,228 statute articles, expert annotations.
LexiT toolkit: modular RAG pipeline, processors, retrievers, generators, and LLM-as-a-judge evaluation.
Key Findings
Best retrieval Recall@10 is low (hard task).
Dense retrievers beat lexical matching on multi-turn queries.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Retrieval Recall@10 | 33.33% (GTE-Qwen2-1.5B, Query Rewrite) | BM25 Query Rewrite 18.84% | +14.49 pp | LexRAG (conversational knowledge retrieval) | Table 3: GTE-Qwen2-1.5B Query Rewrite Recall@10 33.33% | Table 3 |
| Retrieval Recall@1 | 11.46% (GTE-Qwen2-1.5B, Query Rewrite) | BM25 Query Rewrite 5.73% | +5.73 pp | LexRAG | Table 3: Recall@1 values | Table 3 |
What To Try In 7 Days
Clone LexiT and run the provided retrieval baselines on LexRAG to reproduce numbers.
Compare 'Last Query' vs 'Query Rewrite' processing for your retriever and log Recall@10.
Test Qwen-2.5-72B with and without ground-truth articles to measure practical uplift in your stack.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Dataset covers Chinese statutory law only; not multilingual or cross-jurisdictional.
Multi-turn dialogues were authored and expanded by legal experts, so they may miss real-user interaction diversity.
When Not To Use
Do not use LexRAG as the sole safety check for production legal advice without human lawyers.
Not suitable for evaluating non-Chinese legal systems or multilingual pipelines.
Failure Modes
Retriever misses relevant statutes due to pronouns and implicit legal intent.
Noisy or partial retrieved articles can mislead the generator and reduce judged quality.

