Multi-agent, retrieval-first system that cuts legal LLM hallucinations by iterating search, judge, and summary

Overview

Decision SnapshotNeeds Validation

The approach is practically useful for legal QA: it materially improves accuracy and grounding on the evaluated benchmark but requires engineering to integrate retrieval APIs and tolerates much higher latency.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Ziqi Wang, Boqin Yuan

Links

Abstract / PDF / Code

Why It Matters For Business

If legal accuracy and sourceable answers matter, invest in an agentic RAG pipeline with an evidence-checking judge; expect big accuracy and trust gains but higher latency and engineering integration costs.

Who Should Care

Product Manager CTO ML Engineer Data Scientist

Summary TLDR

L-MARS is a multi-agent pipeline for legal question answering that alternates query decomposition, targeted retrieval (web, local BM25, CourtListener case law), and a Judge Agent that checks evidence sufficiency before synthesizing answers. On a new 200-question LegalSearchQA benchmark, L-MARS raises multiple-choice accuracy from ~0.86–0.89 (pure LLMs) to 0.96 (simple mode) and 0.98 (multi-turn), reduces a rule-based uncertainty score (U-Score) from ~0.55–0.62 to ~0.39–0.42, but increases latency (13.6s simple, 55.7s multi-turn). Code is publicly linked. The system is practical when accuracy and citation grounding matter more than low latency.

Problem Statement

Law questions need up-to-date, jurisdiction-specific evidence. Single-pass LLMs hallucinate or hedge when their training cutoff or retrieval misses authoritative sources. The paper asks: can iterative agentic search plus an evidence-checking Judge Agent reduce hallucination and uncertainty while grounding answers in authoritative law?

Main Contribution

L-MARS: a multi-agent workflow that interleaves query decomposition, targeted retrieval, judge-based sufficiency checks, and final summarization.

Agentic search using Serper (web), a local BM25 RAG index, and CourtListener for case law; snippet-anchored content extraction to limit context.

Key Findings

Multi-turn L-MARS accuracy on LegalSearchQA

NumbersAccuracy 0.98 vs GPT-4o 0.89 on 200 questions

Practical UseUse multi-turn mode when correctness is critical: expect an ~9 percentage-point accuracy lift on the evaluated legal QA set at the cost of extra latency.

Evidence RefTable 3, §4.4

L-MARS reduces model uncertainty by the U-Score metric

NumbersU-Score drops from 0.55–0.62 to 0.39–0.42

Practical UseIterative retrieval plus a Judge Agent yields clearer, better-cited answers; use it to lower hedging and vague time/jurisdiction claims in legal outputs.

Evidence RefTable 3, §4.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	L-MARS multi-turn 0.98; simple 0.96; GPT-4o 0.89; Claude 0.88; Gemini 0.86	GPT-4o 0.89	+0.09 (multi-turn vs GPT-4o)	LegalSearchQA (n=200)	Table 3, §4.4	Table 3
U-Score (lower is better)	L-MARS multi-turn 0.39; simple 0.42; GPT-4o 0.55; Claude 0.62; Gemini 0.58	GPT-4o 0.55	-0.16 (multi-turn vs GPT-4o)	LegalSearchQA (n=200)	Table 3, §4.4	Table 3

What To Try In 7 Days

Run L-MARS simple mode on a small set of customer legal questions to measure accuracy vs latency.

Add CourtListener or other authoritative APIs for domain-critical questions to improve citation strength.

Implement a Judge-like checklist (authority, date, jurisdiction) to gate answer finalization and track rejections.

Agent Features

Memory

Centralized WorkflowState tracking query, accumulated results, iteration history

Planning

Query decomposition into clarifying sub-questionsIterative search planning guided by missing-evidence directives

Tool Use

Web search via Serper APICourtListener case law APILocal BM25 retrieverHTML/PDF scraping (BeautifulSoup, pdfplumber)

Frameworks

LangGraph

Is Agentic

Yes

Architectures

Directed Acyclic Graph (DAG) workflownode-based agents (Query, Search, Judge, Summary)

Collaboration

Sequential agent hand-off with conditional routingJudge Agent enforces stopping-rule across iterations

Optimization Features

Token Efficiency

Token-bounded content extraction (2.5k-char windows, hard caps)LoRA

System Optimization

Dynamic local index updates without restartDeterministic, temperature=0 Judge Agent to reduce variance

Inference Optimization

Snippet-anchored extraction to limit context windowBasic vs. Enhanced search modes for latency/recall tradeoffs

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/boqiny/LMARS

Risks & Boundaries

Limitations

System accuracy depends on retrieval quality; missing authorities still cause errors.

Multi-turn mode has high latency (≈56s) that limits real-time use.

When Not To Use

When sub-second or low-latency responses are required.

For non-US jurisdictions without integrated authoritative retrieval sources.

Failure Modes

Missed authoritative documents cause hallucinations despite the Judge checks.

Judge over-rejection increases latency and costs by triggering extra searches.

Core Entities

Models

L-MARSGPT-4oClaude-4-SonnetGemini-2.5-FlashGPT-o3

Metrics

AccuracyU-ScoreLLM-as-Judge

Datasets

LegalSearchQALegalBenchLexGLUEPile-of-Law

Benchmarks

LegalSearchQA

Context Entities

Models

OpenAI o1Qwen-QwQDeepSeekR1

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Multi-turn L-MARS accuracy on LegalSearchQA

L-MARS reduces model uncertainty by the U-Score metric

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding