BRIEFME: a SCOTUS-briefs benchmark testing summarization, completion, and case retrieval

June 7, 20258 min

Overview

Decision SnapshotReady For Pilot

The dataset and evaluation are well scoped and tested; summarization and guided completion are near production-ready for assistive use, but retrieval and realistic completion placement require more engineering and human review.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Jesse Woo, Fateme Hashemi Chaleshtori, Ana Marasović, Kenneth Marino

Links

Abstract / PDF

Why It Matters For Business

Automating headings and guided completion can speed legal drafting and document navigation; however, retrieval and placement are not reliable enough to omit expert review.

Who Should Care

Summary TLDR

BRIEFME is a new dataset of U.S. Supreme Court briefs (2017–2024) built to test LLMs on three practical drafting tasks: argument summarization (make short section headings), argument completion (fill or suggest missing headings in a table of contents), and case retrieval (find the cited precedent). Strong commercial LLMs (GPT-4o) already beat human headings on summarization and guided completion by judge ratings (≈4.3 vs 3.4), but automated retrieval and realistic completion placement remain weak (top-5 retrieval ≈31%, correct heading placement ≈18%). The dataset, judge, and baselines are provided to accelerate legal-drafting tools while stressing that human review is required.

Problem Statement

Legal NLP has focused on judicial opinions. Drafting and structuring attorney briefs — writing persuasive section headings, completing missing arguments, and finding supporting cases — is underexplored. We need a benchmark and baselines to measure model progress on these concrete drafting tasks.

Main Contribution

BRIEFME: a dataset of SCOTUS merits briefs (2017–Mar 2024) with structured sections and annotations

Three practical tasks: argument summarization, argument completion (guided and realistic), and case retrieval

Key Findings

Large LLMs already produce high-quality brief headings for summarization and guided completion

NumbersGPT-4o judge rating 4.3/5 vs human headings ~3.4/5 (summarization)

Practical UseUse few-shot GPT-4o to draft or refine section headings; outputs typically need only minor edits.

Evidence RefPaper §4.2; Table 3

Realistic argument completion (detect + place + generate) is still hard

NumbersModel correctly places missing heading only 18% of the time; heading-level accuracy 86%

Practical UseUse LLMs to propose headings or detect missing structure, but do not trust automatic placement without human validation.

Evidence RefPaper §4.3; Table 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BRIEFME size (summarization/completion/retrieval examples)23332 / 5905 / 91086Table 1 (paper)Train/Test/Dev counts in Table 1Table 1
Argument summarization quality (LLM judge average)GPT-4o ≈ 4.3 / 5 (few-shot)Human headings ≈ 3.4 / 5 (unfiltered)+0.9 judge points vs humanBRIEFME test§4.2 and Table 3Table 3

What To Try In 7 Days

Pilot GPT-4o few-shot prompts to auto-generate section headings and measure saved edit time

Use the paper's LLM-as-judge prompt to filter low-quality human or model headings before review

Run a hybrid retrieval pipeline: BM25 initial pass + ColBERT fine-tuned reranker and manual verification

Optimization Features

Training Optimization
LoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Data limited to English and U.S. Supreme Court briefs (2017–Mar 2024); not representative of other jurisdictions or lower-court practice

Evaluation relies heavily on an LLM judge; although validated by meta-review, expert human ratings remain variable

When Not To Use

Do not use model outputs as final legal filings without lawyer review

Do not rely on BRIEFME-trained retrievers for confidential or non-U.S. legal work

Failure Modes

Hallucinated or incorrect case citations

Headings that are persuasive-sounding but legally imprecise

Core Entities

Models

GPT-4oLlama-3.1-70BQwen-2.5-32bMistral-7bGemma-2-9bColBERTDPRBM25SAILERCaseEncoder

Metrics

o3-mini judge rating (1–5)Recall@kMRR@10nDCG@10SummaCBLEUROUGEBERTScoreLegalBERT score

Datasets

BRIEFMESCOTUS briefs (2017-2024)retrieval corpus (cited cases from courtlistener)

Benchmarks

BRIEFME argument summarizationBRIEFME argument completionBRIEFME case retrieval

Context Entities

Models

GemmaQwenMistralLlama 3.1 family

Datasets

LePaRDCLERCCaseSummMulti-LexSum