Membership inference mostly fails on pretrained LLMs; apparent successes often come from dataset shifts

February 12, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

10

Authors

Michael Duan, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, Hannaneh Hajishirzi

Links

Abstract / PDF

Why It Matters For Business

Most standard membership inference tests will not show large privacy leakage for models pre-trained at scale; but careless benchmark choices (e.g., temporally shifted non-members) can falsely signal leakage.

Summary TLDR

The authors run a large-scale, reproducible evaluation of five membership inference attacks (MIAs) against language models trained on the Pile (PYTHIA suite, GPT‑NEO, SILO, etc.). Across most domains and model sizes (up to 12B), MIAs perform near random (AUC ≈ 0.5–0.6). Two main causes explain this: large pretraining corpora with near-one-epoch training reduce memorization, and high lexical overlap between training and candidate non-members makes membership ambiguous. When MIAs succeed, it is often because the non-member set is unintentionally shifted (e.g., temporally newer data) and thus easier to separate. The paper releases MIMIR, a unified benchmark and code for future audits.

Problem Statement

We lack a clear answer whether standard membership inference attacks can detect which texts were in an LLM's pretraining corpus. Prior work shows MIAs work on classifiers or fine-tuned LMs, but it's unclear if those results apply to large-scale pretraining. This paper evaluates existing MIAs at scale and diagnoses why they often fail or succeed.

Main Contribution

Large-scale evaluation of five black-box MIAs across LLM families (PYTHIA, GPT-NEO, SILO) and Pile domains.

Finding: MIAs are near-random in most domains (AUC < 0.6) but can be high when non-members are distributionally shifted.

Diagnosis: two confounders — massive pretraining with ~1 epoch (reduces memorization) and high n-gram overlap between members and non-members (makes membership fuzzy).

Benchmarks and code release: MIMIR package and HuggingFace dataset for reproducible MIA evaluation.

Key Findings

Existing MIAs mostly fail against pre-trained LLMs.

NumbersMost AUC ROC < 0.6 across domains (Table 1).

Training scale and schedule reduce the effectiveness of MIAs.

NumbersMIA spikes early in training then decreases across checkpoints; multi-epoch training raises AUC (Figure 2).

High lexical overlap between members and non-members makes membership ambiguous.

NumbersAverage 7‑gram overlap: Wikipedia 32.5%, ArXiv 39.3%, PubMed 41.0%; GitHub 76.9% (Figure 3).

Selecting temporally shifted non-members inflates MIA performance.

NumbersReference-based AUC up to 0.796 on PYTHIA-DEDUP-12B with temporally shifted Wikipedia (Table 3).

Small lexical or semantic edits convert 'members' into confidently classified non-members.

NumbersModified members at small edit distances yield FPR near 0% at thresholds with 1–10% nominal FPR (Table 4 & Table 10).

Results

Typical MIA AUC ROC (across domains)

Value< 0.6 (near-random) on most domains

BaselineRandom = 0.5

GitHub domain AUC

Value≈ 0.67–0.74 (best-performing domain)

BaselineOther domains ~0.5

AUC under temporal shift

Valueup to 0.796 (reference-based, 12B)

Baseline≈ 0.52 on non-shifted Wikipedia

7‑gram overlap (average)

ValueWikipedia 32.5%, ArXiv 39.3%, PubMed 41.0%, GitHub 76.9%

Baselinen-gram overlap varies by domain

Who Should Care

What To Try In 7 Days

Run the MIMIR package against your model and domains to reproduce results quickly.

Compute n‑gram overlap between candidate non-members and your training corpus to check representativeness.

If auditing leakage, test on both single-epoch pretraining and any fine-tuning checkpoints you use in production.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation focuses on black-box MIAs; white-box or stronger extraction attacks may behave differently.
  • Target models are up to 12B and a subset of available LLM families; trends may differ at larger scales.
  • High-overlap domains (GitHub) and decontamination choices can produce outlier results.
  • Cannot fully reproduce all prior work pipelines and exact non-member choices, limiting direct comparisons.

When Not To Use

  • If you audit fine-tuned models or multi-epoch training: expect higher leakage than reported here.
  • For white-box threat models or strong extraction attacks; other methods may find more leakage.
  • To conclude absence of all privacy risk from AUC ≈ 0.5; extractability/PII tests need separate analysis.

Failure Modes

  • Temporal or topical distribution shifts between members and non-members cause inflated MIA scores (false positives about 'memorization').
  • High n‑gram overlap obscures real leakage and causes near-chance MIA performance.
  • Small edits or paraphrases of true members are often classified as non-members, causing false negatives relative to practical leakage.

Core Entities

Models

  • PYTHIA
  • PYTHIA-DEDUP
  • GPT-NEO
  • SILO
  • DATABLATIONS
  • OLMO
  • STABLELM-BASE-ALPHA-3B-V2

Metrics

  • AUC ROC
  • TPR @ low% FPR (e.g., 1% FPR)

Datasets

  • The Pile
  • Pile-CC
  • Wikipedia
  • ArXiv
  • PubMed Central
  • GitHub
  • DM Mathematics
  • HackerNews
  • C4
  • RealTimeData WikiText

Benchmarks

  • MIMIR (membership inference benchmark package)
  • Temporal non-member benchmarks
  • n-gram overlap thresholding benchmarks

Context Entities

Models

  • GPT-2
  • OPT
  • DISTILGPT2
  • LLAMA
  • StableLM