Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Automating headings and guided completion can speed legal drafting and document navigation; however, retrieval and placement are not reliable enough to omit expert review.
Summary TLDR
BRIEFME is a new dataset of U.S. Supreme Court briefs (2017–2024) built to test LLMs on three practical drafting tasks: argument summarization (make short section headings), argument completion (fill or suggest missing headings in a table of contents), and case retrieval (find the cited precedent). Strong commercial LLMs (GPT-4o) already beat human headings on summarization and guided completion by judge ratings (≈4.3 vs 3.4), but automated retrieval and realistic completion placement remain weak (top-5 retrieval ≈31%, correct heading placement ≈18%). The dataset, judge, and baselines are provided to accelerate legal-drafting tools while stressing that human review is required.
Problem Statement
Legal NLP has focused on judicial opinions. Drafting and structuring attorney briefs — writing persuasive section headings, completing missing arguments, and finding supporting cases — is underexplored. We need a benchmark and baselines to measure model progress on these concrete drafting tasks.
Main Contribution
BRIEFME: a dataset of SCOTUS merits briefs (2017–Mar 2024) with structured sections and annotations
Three practical tasks: argument summarization, argument completion (guided and realistic), and case retrieval
An LLM-as-a-judge evaluation pipeline (o3-mini) used to filter low-quality human examples and to score outputs
Comprehensive benchmarks across many LLMs and retrieval methods; analysis of generalization and errors
Key Findings
Large LLMs already produce high-quality brief headings for summarization and guided completion
Realistic argument completion (detect + place + generate) is still hard
Case retrieval for brief citations performs poorly with off-the-shelf methods
An LLM judge (o3-mini) provided more consistent meta-ratings than recruited human annotators
Performance generalizes to held-out briefs published after model cutoffs
Results
BRIEFME size (summarization/completion/retrieval examples)
Argument summarization quality (LLM judge average)
Guided argument completion quality (LLM judge average)
Accuracy
Case retrieval top-5 recall
LLM-as-judge meta-rating vs human annotators
Who Should Care
What To Try In 7 Days
Pilot GPT-4o few-shot prompts to auto-generate section headings and measure saved edit time
Use the paper's LLM-as-judge prompt to filter low-quality human or model headings before review
Run a hybrid retrieval pipeline: BM25 initial pass + ColBERT fine-tuned reranker and manual verification
Optimization Features
Training Optimization
- LoRA
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Data limited to English and U.S. Supreme Court briefs (2017–Mar 2024); not representative of other jurisdictions or lower-court practice
- Evaluation relies heavily on an LLM judge; although validated by meta-review, expert human ratings remain variable
- Case retrieval corpus depends on correctly detected citations; missing eyecite detections are excluded from measurement
When Not To Use
- Do not use model outputs as final legal filings without lawyer review
- Do not rely on BRIEFME-trained retrievers for confidential or non-U.S. legal work
- Do not assume automatic citation retrieval is exhaustive or authoritative
Failure Modes
- Hallucinated or incorrect case citations
- Headings that are persuasive-sounding but legally imprecise
- Misplaced headings in table-of-contents structure
- Relying on memorized text rather than reasoning in edge cases
Core Entities
Models
- GPT-4o
- Llama-3.1-70B
- Qwen-2.5-32b
- Mistral-7b
- Gemma-2-9b
- ColBERT
- DPR
- BM25
- SAILER
- CaseEncoder
Metrics
- o3-mini judge rating (1–5)
- Recall@k
- MRR@10
- nDCG@10
- SummaC
- BLEU
- ROUGE
- BERTScore
- LegalBERT score
Datasets
- BRIEFME
- SCOTUS briefs (2017-2024)
- retrieval corpus (cited cases from courtlistener)
Benchmarks
- BRIEFME argument summarization
- BRIEFME argument completion
- BRIEFME case retrieval
Context Entities
Models
- Gemma
- Qwen
- Mistral
- Llama 3.1 family
Datasets
- LePaRD
- CLERC
- CaseSumm
- Multi-LexSum

