Multi-agent, retrieval-first system that cuts legal LLM hallucinations by iterating search, judge, and summary

August 31, 20257 min

Overview

Decision SnapshotNeeds Validation

The approach is practically useful for legal QA: it materially improves accuracy and grounding on the evaluated benchmark but requires engineering to integrate retrieval APIs and tolerates much higher latency.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Ziqi Wang, Boqin Yuan

Links

Abstract / PDF / Code

Why It Matters For Business

If legal accuracy and sourceable answers matter, invest in an agentic RAG pipeline with an evidence-checking judge; expect big accuracy and trust gains but higher latency and engineering integration costs.

Who Should Care

Summary TLDR

L-MARS is a multi-agent pipeline for legal question answering that alternates query decomposition, targeted retrieval (web, local BM25, CourtListener case law), and a Judge Agent that checks evidence sufficiency before synthesizing answers. On a new 200-question LegalSearchQA benchmark, L-MARS raises multiple-choice accuracy from ~0.86–0.89 (pure LLMs) to 0.96 (simple mode) and 0.98 (multi-turn), reduces a rule-based uncertainty score (U-Score) from ~0.55–0.62 to ~0.39–0.42, but increases latency (13.6s simple, 55.7s multi-turn). Code is publicly linked. The system is practical when accuracy and citation grounding matter more than low latency.

Problem Statement

Law questions need up-to-date, jurisdiction-specific evidence. Single-pass LLMs hallucinate or hedge when their training cutoff or retrieval misses authoritative sources. The paper asks: can iterative agentic search plus an evidence-checking Judge Agent reduce hallucination and uncertainty while grounding answers in authoritative law?

Main Contribution

L-MARS: a multi-agent workflow that interleaves query decomposition, targeted retrieval, judge-based sufficiency checks, and final summarization.

Agentic search using Serper (web), a local BM25 RAG index, and CourtListener for case law; snippet-anchored content extraction to limit context.

Key Findings

Multi-turn L-MARS accuracy on LegalSearchQA

NumbersAccuracy 0.98 vs GPT-4o 0.89 on 200 questions

Practical UseUse multi-turn mode when correctness is critical: expect an ~9 percentage-point accuracy lift on the evaluated legal QA set at the cost of extra latency.

Evidence RefTable 3, §4.4

L-MARS reduces model uncertainty by the U-Score metric

NumbersU-Score drops from 0.550.62 to 0.390.42

Practical UseIterative retrieval plus a Judge Agent yields clearer, better-cited answers; use it to lower hedging and vague time/jurisdiction claims in legal outputs.

Evidence RefTable 3, §4.4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyL-MARS multi-turn 0.98; simple 0.96; GPT-4o 0.89; Claude 0.88; Gemini 0.86GPT-4o 0.89+0.09 (multi-turn vs GPT-4o)LegalSearchQA (n=200)Table 3, §4.4Table 3
U-Score (lower is better)L-MARS multi-turn 0.39; simple 0.42; GPT-4o 0.55; Claude 0.62; Gemini 0.58GPT-4o 0.55-0.16 (multi-turn vs GPT-4o)LegalSearchQA (n=200)Table 3, §4.4Table 3

What To Try In 7 Days

Run L-MARS simple mode on a small set of customer legal questions to measure accuracy vs latency.

Add CourtListener or other authoritative APIs for domain-critical questions to improve citation strength.

Implement a Judge-like checklist (authority, date, jurisdiction) to gate answer finalization and track rejections.

Agent Features

Memory
Centralized WorkflowState tracking query, accumulated results, iteration history
Planning
Query decomposition into clarifying sub-questionsIterative search planning guided by missing-evidence directives
Tool Use
Web search via Serper APICourtListener case law APILocal BM25 retrieverHTML/PDF scraping (BeautifulSoup, pdfplumber)
Frameworks
LangGraph
Is Agentic

Yes

Architectures
Directed Acyclic Graph (DAG) workflownode-based agents (Query, Search, Judge, Summary)
Collaboration
Sequential agent hand-off with conditional routingJudge Agent enforces stopping-rule across iterations

Optimization Features

Token Efficiency
Token-bounded content extraction (2.5k-char windows, hard caps)LoRA
System Optimization
Dynamic local index updates without restartDeterministic, temperature=0 Judge Agent to reduce variance
Inference Optimization
Snippet-anchored extraction to limit context windowBasic vs. Enhanced search modes for latency/recall tradeoffs

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

System accuracy depends on retrieval quality; missing authorities still cause errors.

Multi-turn mode has high latency (≈56s) that limits real-time use.

When Not To Use

When sub-second or low-latency responses are required.

For non-US jurisdictions without integrated authoritative retrieval sources.

Failure Modes

Missed authoritative documents cause hallucinations despite the Judge checks.

Judge over-rejection increases latency and costs by triggering extra searches.

Core Entities

Models

L-MARSGPT-4oClaude-4-SonnetGemini-2.5-FlashGPT-o3

Metrics

AccuracyU-ScoreLLM-as-Judge

Datasets

LegalSearchQALegalBenchLexGLUEPile-of-Law

Benchmarks

LegalSearchQA

Context Entities

Models

OpenAI o1Qwen-QwQDeepSeekR1