BRIEFME: a SCOTUS-briefs benchmark testing summarization, completion, and case retrieval

June 7, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

0

Authors

Jesse Woo, Fateme Hashemi Chaleshtori, Ana Marasović, Kenneth Marino

Links

Abstract / PDF

Why It Matters For Business

Automating headings and guided completion can speed legal drafting and document navigation; however, retrieval and placement are not reliable enough to omit expert review.

Summary TLDR

BRIEFME is a new dataset of U.S. Supreme Court briefs (2017–2024) built to test LLMs on three practical drafting tasks: argument summarization (make short section headings), argument completion (fill or suggest missing headings in a table of contents), and case retrieval (find the cited precedent). Strong commercial LLMs (GPT-4o) already beat human headings on summarization and guided completion by judge ratings (≈4.3 vs 3.4), but automated retrieval and realistic completion placement remain weak (top-5 retrieval ≈31%, correct heading placement ≈18%). The dataset, judge, and baselines are provided to accelerate legal-drafting tools while stressing that human review is required.

Problem Statement

Legal NLP has focused on judicial opinions. Drafting and structuring attorney briefs — writing persuasive section headings, completing missing arguments, and finding supporting cases — is underexplored. We need a benchmark and baselines to measure model progress on these concrete drafting tasks.

Main Contribution

BRIEFME: a dataset of SCOTUS merits briefs (2017–Mar 2024) with structured sections and annotations

Three practical tasks: argument summarization, argument completion (guided and realistic), and case retrieval

An LLM-as-a-judge evaluation pipeline (o3-mini) used to filter low-quality human examples and to score outputs

Comprehensive benchmarks across many LLMs and retrieval methods; analysis of generalization and errors

Key Findings

Large LLMs already produce high-quality brief headings for summarization and guided completion

NumbersGPT-4o judge rating 4.3/5 vs human headings ~3.4/5 (summarization)

Realistic argument completion (detect + place + generate) is still hard

NumbersModel correctly places missing heading only 18% of the time; heading-level accuracy 86%

Case retrieval for brief citations performs poorly with off-the-shelf methods

NumbersBest retrieval (ColBERT SFT+rerank) R@5 ≈ 31.4%; top-5 correct ≈ 31.5%

An LLM judge (o3-mini) provided more consistent meta-ratings than recruited human annotators

NumbersJudge meta-rating averaged 4.6 (summarization) and 4.9 (completion) vs human meta-ratings range 2.4–4.3

Performance generalizes to held-out briefs published after model cutoffs

NumbersHeld-out summarization drop only ≈0.7% in LLM-judge score

Results

BRIEFME size (summarization/completion/retrieval examples)

Value23332 / 5905 / 91086

Argument summarization quality (LLM judge average)

ValueGPT-4o ≈ 4.3 / 5 (few-shot)

BaselineHuman headings ≈ 3.4 / 5 (unfiltered)

Guided argument completion quality (LLM judge average)

ValueGPT-4o ≈ 4.3 / 5

BaselineHuman headings ≈ 3.5 / 5

Accuracy

Value18% correct placement (avg)

Case retrieval top-5 recall

ValueColBERT (SFT+rerank) R@5 = 31.4%

BaselineBM25 R@5 = 19.6% (zero-shot)

LLM-as-judge meta-rating vs human annotators

ValueJudge meta-rating avg: 4.6 (summarization), 4.9 (completion)

BaselineHuman meta-rating range 2.4–4.3

Who Should Care

What To Try In 7 Days

Pilot GPT-4o few-shot prompts to auto-generate section headings and measure saved edit time

Use the paper's LLM-as-judge prompt to filter low-quality human or model headings before review

Run a hybrid retrieval pipeline: BM25 initial pass + ColBERT fine-tuned reranker and manual verification

Optimization Features

Training Optimization

  • LoRA

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Data limited to English and U.S. Supreme Court briefs (2017–Mar 2024); not representative of other jurisdictions or lower-court practice
  • Evaluation relies heavily on an LLM judge; although validated by meta-review, expert human ratings remain variable
  • Case retrieval corpus depends on correctly detected citations; missing eyecite detections are excluded from measurement

When Not To Use

  • Do not use model outputs as final legal filings without lawyer review
  • Do not rely on BRIEFME-trained retrievers for confidential or non-U.S. legal work
  • Do not assume automatic citation retrieval is exhaustive or authoritative

Failure Modes

  • Hallucinated or incorrect case citations
  • Headings that are persuasive-sounding but legally imprecise
  • Misplaced headings in table-of-contents structure
  • Relying on memorized text rather than reasoning in edge cases

Core Entities

Models

  • GPT-4o
  • Llama-3.1-70B
  • Qwen-2.5-32b
  • Mistral-7b
  • Gemma-2-9b
  • ColBERT
  • DPR
  • BM25
  • SAILER
  • CaseEncoder

Metrics

  • o3-mini judge rating (1–5)
  • Recall@k
  • MRR@10
  • nDCG@10
  • SummaC
  • BLEU
  • ROUGE
  • BERTScore
  • LegalBERT score

Datasets

  • BRIEFME
  • SCOTUS briefs (2017-2024)
  • retrieval corpus (cited cases from courtlistener)

Benchmarks

  • BRIEFME argument summarization
  • BRIEFME argument completion
  • BRIEFME case retrieval

Context Entities

Models

  • Gemma
  • Qwen
  • Mistral
  • Llama 3.1 family

Datasets

  • LePaRD
  • CLERC
  • CaseSumm
  • Multi-LexSum