Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
If legal accuracy and sourceable answers matter, invest in an agentic RAG pipeline with an evidence-checking judge; expect big accuracy and trust gains but higher latency and engineering integration costs.
Summary TLDR
L-MARS is a multi-agent pipeline for legal question answering that alternates query decomposition, targeted retrieval (web, local BM25, CourtListener case law), and a Judge Agent that checks evidence sufficiency before synthesizing answers. On a new 200-question LegalSearchQA benchmark, L-MARS raises multiple-choice accuracy from ~0.86–0.89 (pure LLMs) to 0.96 (simple mode) and 0.98 (multi-turn), reduces a rule-based uncertainty score (U-Score) from ~0.55–0.62 to ~0.39–0.42, but increases latency (13.6s simple, 55.7s multi-turn). Code is publicly linked. The system is practical when accuracy and citation grounding matter more than low latency.
Problem Statement
Law questions need up-to-date, jurisdiction-specific evidence. Single-pass LLMs hallucinate or hedge when their training cutoff or retrieval misses authoritative sources. The paper asks: can iterative agentic search plus an evidence-checking Judge Agent reduce hallucination and uncertainty while grounding answers in authoritative law?
Main Contribution
L-MARS: a multi-agent workflow that interleaves query decomposition, targeted retrieval, judge-based sufficiency checks, and final summarization.
Agentic search using Serper (web), a local BM25 RAG index, and CourtListener for case law; snippet-anchored content extraction to limit context.
LegalSearchQA: a 200-question benchmark focused on post-training legal facts (as of 2025) plus evaluation metrics: Accuracy, U-Score (uncertainty), and LLM-as-Judge.
Empirical results showing large accuracy and uncertainty improvements versus strong LLM baselines, and a public code repository.
Key Findings
Multi-turn L-MARS accuracy on LegalSearchQA
L-MARS reduces model uncertainty by the U-Score metric
Iteration increases latency
Judge Agent can be over-conservative
Results
Accuracy
U-Score (lower is better)
Response Time (s)
Who Should Care
What To Try In 7 Days
Run L-MARS simple mode on a small set of customer legal questions to measure accuracy vs latency.
Add CourtListener or other authoritative APIs for domain-critical questions to improve citation strength.
Implement a Judge-like checklist (authority, date, jurisdiction) to gate answer finalization and track rejections.
Agent Features
Memory
- Centralized WorkflowState tracking query, accumulated results, iteration history
Planning
- Query decomposition into clarifying sub-questions
- Iterative search planning guided by missing-evidence directives
Tool Use
- Web search via Serper API
- CourtListener case law API
- Local BM25 retriever
- HTML/PDF scraping (BeautifulSoup, pdfplumber)
Frameworks
- LangGraph
Is Agentic
true
Architectures
- Directed Acyclic Graph (DAG) workflow
- node-based agents (Query, Search, Judge, Summary)
Collaboration
- Sequential agent hand-off with conditional routing
- Judge Agent enforces stopping-rule across iterations
Optimization Features
Token Efficiency
- Token-bounded content extraction (2.5k-char windows, hard caps)
- LoRA
System Optimization
- Dynamic local index updates without restart
- Deterministic, temperature=0 Judge Agent to reduce variance
Inference Optimization
- Snippet-anchored extraction to limit context window
- Basic vs. Enhanced search modes for latency/recall tradeoffs
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- System accuracy depends on retrieval quality; missing authorities still cause errors.
- Multi-turn mode has high latency (≈56s) that limits real-time use.
- Evaluation restricted to US-centric LegalSearchQA (200 items); cross-jurisdiction generality untested.
- Judge Agent can be over-conservative and cause unnecessary search iterations.
When Not To Use
- When sub-second or low-latency responses are required.
- For non-US jurisdictions without integrated authoritative retrieval sources.
- If no reliable web or legal database access is available.
- When a compact, offline LLM without retrieval is mandated.
Failure Modes
- Missed authoritative documents cause hallucinations despite the Judge checks.
- Judge over-rejection increases latency and costs by triggering extra searches.
- Models default to heuristic priors when retrieval returns ambiguous or partial evidence.
- Confusing narrow statutory exceptions leads to misclassification.
Core Entities
Models
- L-MARS
- GPT-4o
- Claude-4-Sonnet
- Gemini-2.5-Flash
- GPT-o3
Metrics
- Accuracy
- U-Score
- LLM-as-Judge
Datasets
- LegalSearchQA
- LegalBench
- LexGLUE
- Pile-of-Law
Benchmarks
- LegalSearchQA
Context Entities
Models
- OpenAI o1
- Qwen-QwQ
- DeepSeekR1

