BRIEFME: a SCOTUS-briefs benchmark testing summarization, completion, and case retrieval

Overview

Decision SnapshotReady For Pilot

The dataset and evaluation are well scoped and tested; summarization and guided completion are near production-ready for assistive use, but retrieval and realistic completion placement require more engineering and human review.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Jesse Woo, Fateme Hashemi Chaleshtori, Ana Marasović, Kenneth Marino

Links

Abstract / PDF

Why It Matters For Business

Automating headings and guided completion can speed legal drafting and document navigation; however, retrieval and placement are not reliable enough to omit expert review.

Who Should Care

Product Manager ML Engineer Data Scientist

Summary TLDR

BRIEFME is a new dataset of U.S. Supreme Court briefs (2017–2024) built to test LLMs on three practical drafting tasks: argument summarization (make short section headings), argument completion (fill or suggest missing headings in a table of contents), and case retrieval (find the cited precedent). Strong commercial LLMs (GPT-4o) already beat human headings on summarization and guided completion by judge ratings (≈4.3 vs 3.4), but automated retrieval and realistic completion placement remain weak (top-5 retrieval ≈31%, correct heading placement ≈18%). The dataset, judge, and baselines are provided to accelerate legal-drafting tools while stressing that human review is required.

Problem Statement

Legal NLP has focused on judicial opinions. Drafting and structuring attorney briefs — writing persuasive section headings, completing missing arguments, and finding supporting cases — is underexplored. We need a benchmark and baselines to measure model progress on these concrete drafting tasks.

Main Contribution

BRIEFME: a dataset of SCOTUS merits briefs (2017–Mar 2024) with structured sections and annotations

Three practical tasks: argument summarization, argument completion (guided and realistic), and case retrieval

Key Findings

Large LLMs already produce high-quality brief headings for summarization and guided completion

NumbersGPT-4o judge rating 4.3/5 vs human headings ~3.4/5 (summarization)

Practical UseUse few-shot GPT-4o to draft or refine section headings; outputs typically need only minor edits.

Evidence RefPaper §4.2; Table 3

Realistic argument completion (detect + place + generate) is still hard

NumbersModel correctly places missing heading only 18% of the time; heading-level accuracy 86%

Practical UseUse LLMs to propose headings or detect missing structure, but do not trust automatic placement without human validation.

Evidence RefPaper §4.3; Table 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BRIEFME size (summarization/completion/retrieval examples)	23332 / 5905 / 91086	—	—	Table 1 (paper)	Train/Test/Dev counts in Table 1	Table 1
Argument summarization quality (LLM judge average)	GPT-4o ≈ 4.3 / 5 (few-shot)	Human headings ≈ 3.4 / 5 (unfiltered)	+0.9 judge points vs human	BRIEFME test	§4.2 and Table 3	Table 3

What To Try In 7 Days

Pilot GPT-4o few-shot prompts to auto-generate section headings and measure saved edit time

Use the paper's LLM-as-judge prompt to filter low-quality human or model headings before review

Run a hybrid retrieval pipeline: BM25 initial pass + ColBERT fine-tuned reranker and manual verification

Optimization Features

Training Optimization

LoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Data limited to English and U.S. Supreme Court briefs (2017–Mar 2024); not representative of other jurisdictions or lower-court practice

Evaluation relies heavily on an LLM judge; although validated by meta-review, expert human ratings remain variable

When Not To Use

Do not use model outputs as final legal filings without lawyer review

Do not rely on BRIEFME-trained retrievers for confidential or non-U.S. legal work

Failure Modes

Hallucinated or incorrect case citations

Headings that are persuasive-sounding but legally imprecise

Core Entities

Models

GPT-4oLlama-3.1-70BQwen-2.5-32bMistral-7bGemma-2-9bColBERTDPRBM25SAILERCaseEncoder

Metrics

o3-mini judge rating (1–5)Recall@kMRR@10nDCG@10SummaCBLEUROUGEBERTScoreLegalBERT score

Datasets

BRIEFMESCOTUS briefs (2017-2024)retrieval corpus (cited cases from courtlistener)

Benchmarks

BRIEFME argument summarizationBRIEFME argument completionBRIEFME case retrieval

Context Entities

Models

GemmaQwenMistralLlama 3.1 family

Datasets

LePaRDCLERCCaseSummMulti-LexSum

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Large LLMs already produce high-quality brief headings for summarization and guided completion

Realistic argument completion (detect + place + generate) is still hard

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding