Overview
The approach is practically useful for legal QA: it materially improves accuracy and grounding on the evaluated benchmark but requires engineering to integrate retrieval APIs and tolerates much higher latency.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals12
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
If legal accuracy and sourceable answers matter, invest in an agentic RAG pipeline with an evidence-checking judge; expect big accuracy and trust gains but higher latency and engineering integration costs.
Who Should Care
Summary TLDR
L-MARS is a multi-agent pipeline for legal question answering that alternates query decomposition, targeted retrieval (web, local BM25, CourtListener case law), and a Judge Agent that checks evidence sufficiency before synthesizing answers. On a new 200-question LegalSearchQA benchmark, L-MARS raises multiple-choice accuracy from ~0.86–0.89 (pure LLMs) to 0.96 (simple mode) and 0.98 (multi-turn), reduces a rule-based uncertainty score (U-Score) from ~0.55–0.62 to ~0.39–0.42, but increases latency (13.6s simple, 55.7s multi-turn). Code is publicly linked. The system is practical when accuracy and citation grounding matter more than low latency.
Problem Statement
Law questions need up-to-date, jurisdiction-specific evidence. Single-pass LLMs hallucinate or hedge when their training cutoff or retrieval misses authoritative sources. The paper asks: can iterative agentic search plus an evidence-checking Judge Agent reduce hallucination and uncertainty while grounding answers in authoritative law?
Main Contribution
L-MARS: a multi-agent workflow that interleaves query decomposition, targeted retrieval, judge-based sufficiency checks, and final summarization.
Agentic search using Serper (web), a local BM25 RAG index, and CourtListener for case law; snippet-anchored content extraction to limit context.
Key Findings
Multi-turn L-MARS accuracy on LegalSearchQA
L-MARS reduces model uncertainty by the U-Score metric
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | L-MARS multi-turn 0.98; simple 0.96; GPT-4o 0.89; Claude 0.88; Gemini 0.86 | GPT-4o 0.89 | +0.09 (multi-turn vs GPT-4o) | LegalSearchQA (n=200) | Table 3, §4.4 | Table 3 |
| U-Score (lower is better) | L-MARS multi-turn 0.39; simple 0.42; GPT-4o 0.55; Claude 0.62; Gemini 0.58 | GPT-4o 0.55 | -0.16 (multi-turn vs GPT-4o) | LegalSearchQA (n=200) | Table 3, §4.4 | Table 3 |
What To Try In 7 Days
Run L-MARS simple mode on a small set of customer legal questions to measure accuracy vs latency.
Add CourtListener or other authoritative APIs for domain-critical questions to improve citation strength.
Implement a Judge-like checklist (authority, date, jurisdiction) to gate answer finalization and track rejections.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
System accuracy depends on retrieval quality; missing authorities still cause errors.
Multi-turn mode has high latency (≈56s) that limits real-time use.
When Not To Use
When sub-second or low-latency responses are required.
For non-US jurisdictions without integrated authoritative retrieval sources.
Failure Modes
Missed authoritative documents cause hallucinations despite the Judge checks.
Judge over-rejection increases latency and costs by triggering extra searches.

