Multi-agent, retrieval-first system that cuts legal LLM hallucinations by iterating search, judge, and summary

August 31, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Ziqi Wang, Boqin Yuan

Links

Abstract / PDF

Why It Matters For Business

If legal accuracy and sourceable answers matter, invest in an agentic RAG pipeline with an evidence-checking judge; expect big accuracy and trust gains but higher latency and engineering integration costs.

Summary TLDR

L-MARS is a multi-agent pipeline for legal question answering that alternates query decomposition, targeted retrieval (web, local BM25, CourtListener case law), and a Judge Agent that checks evidence sufficiency before synthesizing answers. On a new 200-question LegalSearchQA benchmark, L-MARS raises multiple-choice accuracy from ~0.86–0.89 (pure LLMs) to 0.96 (simple mode) and 0.98 (multi-turn), reduces a rule-based uncertainty score (U-Score) from ~0.55–0.62 to ~0.39–0.42, but increases latency (13.6s simple, 55.7s multi-turn). Code is publicly linked. The system is practical when accuracy and citation grounding matter more than low latency.

Problem Statement

Law questions need up-to-date, jurisdiction-specific evidence. Single-pass LLMs hallucinate or hedge when their training cutoff or retrieval misses authoritative sources. The paper asks: can iterative agentic search plus an evidence-checking Judge Agent reduce hallucination and uncertainty while grounding answers in authoritative law?

Main Contribution

L-MARS: a multi-agent workflow that interleaves query decomposition, targeted retrieval, judge-based sufficiency checks, and final summarization.

Agentic search using Serper (web), a local BM25 RAG index, and CourtListener for case law; snippet-anchored content extraction to limit context.

LegalSearchQA: a 200-question benchmark focused on post-training legal facts (as of 2025) plus evaluation metrics: Accuracy, U-Score (uncertainty), and LLM-as-Judge.

Empirical results showing large accuracy and uncertainty improvements versus strong LLM baselines, and a public code repository.

Key Findings

Multi-turn L-MARS accuracy on LegalSearchQA

NumbersAccuracy 0.98 vs GPT-4o 0.89 on 200 questions

L-MARS reduces model uncertainty by the U-Score metric

NumbersU-Score drops from 0.55–0.62 to 0.39–0.42

Iteration increases latency

NumbersResponse time: baselines 1.7–3.8s; L-MARS simple 13.6s; multi-turn 55.7s

Judge Agent can be over-conservative

NumbersHuman evaluation flagged unnecessary search iterations; inter-annotator agreement 0.92

Results

Accuracy

ValueL-MARS multi-turn 0.98; simple 0.96; GPT-4o 0.89; Claude 0.88; Gemini 0.86

BaselineGPT-4o 0.89

U-Score (lower is better)

ValueL-MARS multi-turn 0.39; simple 0.42; GPT-4o 0.55; Claude 0.62; Gemini 0.58

BaselineGPT-4o 0.55

Response Time (s)

ValueBaselines 1.69–3.84s; L-MARS simple 13.62s; multi-turn 55.67s

BaselineGPT-4o 1.69s

Who Should Care

What To Try In 7 Days

Run L-MARS simple mode on a small set of customer legal questions to measure accuracy vs latency.

Add CourtListener or other authoritative APIs for domain-critical questions to improve citation strength.

Implement a Judge-like checklist (authority, date, jurisdiction) to gate answer finalization and track rejections.

Agent Features

Memory

  • Centralized WorkflowState tracking query, accumulated results, iteration history

Planning

  • Query decomposition into clarifying sub-questions
  • Iterative search planning guided by missing-evidence directives

Tool Use

  • Web search via Serper API
  • CourtListener case law API
  • Local BM25 retriever
  • HTML/PDF scraping (BeautifulSoup, pdfplumber)

Frameworks

  • LangGraph

Is Agentic

true

Architectures

  • Directed Acyclic Graph (DAG) workflow
  • node-based agents (Query, Search, Judge, Summary)

Collaboration

  • Sequential agent hand-off with conditional routing
  • Judge Agent enforces stopping-rule across iterations

Optimization Features

Token Efficiency

  • Token-bounded content extraction (2.5k-char windows, hard caps)
  • LoRA

System Optimization

  • Dynamic local index updates without restart
  • Deterministic, temperature=0 Judge Agent to reduce variance

Inference Optimization

  • Snippet-anchored extraction to limit context window
  • Basic vs. Enhanced search modes for latency/recall tradeoffs

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • System accuracy depends on retrieval quality; missing authorities still cause errors.
  • Multi-turn mode has high latency (≈56s) that limits real-time use.
  • Evaluation restricted to US-centric LegalSearchQA (200 items); cross-jurisdiction generality untested.
  • Judge Agent can be over-conservative and cause unnecessary search iterations.

When Not To Use

  • When sub-second or low-latency responses are required.
  • For non-US jurisdictions without integrated authoritative retrieval sources.
  • If no reliable web or legal database access is available.
  • When a compact, offline LLM without retrieval is mandated.

Failure Modes

  • Missed authoritative documents cause hallucinations despite the Judge checks.
  • Judge over-rejection increases latency and costs by triggering extra searches.
  • Models default to heuristic priors when retrieval returns ambiguous or partial evidence.
  • Confusing narrow statutory exceptions leads to misclassification.

Core Entities

Models

  • L-MARS
  • GPT-4o
  • Claude-4-Sonnet
  • Gemini-2.5-Flash
  • GPT-o3

Metrics

  • Accuracy
  • U-Score
  • LLM-as-Judge

Datasets

  • LegalSearchQA
  • LegalBench
  • LexGLUE
  • Pile-of-Law

Benchmarks

  • LegalSearchQA

Context Entities

Models

  • OpenAI o1
  • Qwen-QwQ
  • DeepSeekR1