Overview
Production Readiness
0.4
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Legal RAG tools need both higher-quality retrievers and stronger legal reasoning; current off-the-shelf systems miss many citations and can produce unreliable legal detail.
Summary TLDR
LexRAG is a new dataset and toolkit for testing retrieval-augmented generation (RAG) in multi-turn legal consultations in Chinese. It contains 1,013 five-turn conversations, 5,065 queries, and 17,228 legal articles. The authors provide LexiT (a modular RAG toolkit) and an LLM-as-a-judge pipeline. Benchmarks show dense retrievers plus query rewriting help but top retrieval Recall@10 only reaches 33.33%. Generation improves when ground-truth articles are provided, yet models still fall short of expert-level scores (reference best LLM judge ≈7.37 / 10). Code and data are on GitHub.
Problem Statement
There is no standard benchmark to evaluate how well RAG systems retrieve legal texts and generate legally sound answers across multi-turn, evolving consultations. That gap prevents consistent comparison and targeted improvement of legal RAG systems.
Main Contribution
LexRAG dataset: 1,013 multi-turn (5-turn) legal consultation dialogues, 17,228 statute articles, expert annotations.
LexiT toolkit: modular RAG pipeline, processors, retrievers, generators, and LLM-as-a-judge evaluation.
Systematic benchmark: retrieval and generation baselines showing current limits and failure modes in legal multi-turn RAG.
Key Findings
Best retrieval Recall@10 is low (hard task).
Dense retrievers beat lexical matching on multi-turn queries.
Providing ground-truth legal articles raises keyword coverage and judge scores.
Retriever outputs can be noisy and not always helpful.
Annotated dataset and expert-vetted references exist.
Results
Retrieval Recall@10
Retrieval Recall@1
Accuracy
LLM-judge score (ALL)
Who Should Care
What To Try In 7 Days
Clone LexiT and run the provided retrieval baselines on LexRAG to reproduce numbers.
Compare 'Last Query' vs 'Query Rewrite' processing for your retriever and log Recall@10.
Test Qwen-2.5-72B with and without ground-truth articles to measure practical uplift in your stack.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Dataset covers Chinese statutory law only; not multilingual or cross-jurisdictional.
- Multi-turn dialogues were authored and expanded by legal experts, so they may miss real-user interaction diversity.
- Retriever and judge evaluations depend on current LLMs and may reflect judge-model biases.
When Not To Use
- Do not use LexRAG as the sole safety check for production legal advice without human lawyers.
- Not suitable for evaluating non-Chinese legal systems or multilingual pipelines.
Failure Modes
- Retriever misses relevant statutes due to pronouns and implicit legal intent.
- Noisy or partial retrieved articles can mislead the generator and reduce judged quality.
- LLM-as-a-judge may inherit biases or mis-evaluate nuanced legal reasoning.
Core Entities
Models
- GLM-4
- GLM-4-flash
- GPT-3.5-turbo
- GPT-4o-mini
- Qwen-2.5-72B-Instruct
- LLaMA-3.3-70B-Instruct
- Claude-3.5-sonnet
- BGE-base-zh
- GTE-Qwen2-1.5B-instruct
- text-embedding-3
Metrics
- Recall@k
- nDCG@k
- Accuracy
- LLM-judge score
- ROUGE
- BLEU
- BERTScore
Datasets
- LexRAG
Benchmarks
- LexRAG
Context Entities
Datasets
- 222 Chinese statutes (processed into 17,228 provisions)
- Legal Books (26,951 provisions)
- Legal Cases (2,370 guiding cases)

