Membership inference mostly fails on pretrained LLMs; apparent successes often come from dataset shifts

February 12, 20248 min

Overview

Decision SnapshotNeeds Validation

Thorough, reproducible experiments support conclusions, but results are scoped to black-box MIAs and evaluated model families up to 12B parameters.

Citations10

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 60%

Authors

Michael Duan, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, Hannaneh Hajishirzi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Most standard membership inference tests will not show large privacy leakage for models pre-trained at scale; but careless benchmark choices (e.g., temporally shifted non-members) can falsely signal leakage.

Who Should Care

Summary TLDR

The authors run a large-scale, reproducible evaluation of five membership inference attacks (MIAs) against language models trained on the Pile (PYTHIA suite, GPT‑NEO, SILO, etc.). Across most domains and model sizes (up to 12B), MIAs perform near random (AUC ≈ 0.5–0.6). Two main causes explain this: large pretraining corpora with near-one-epoch training reduce memorization, and high lexical overlap between training and candidate non-members makes membership ambiguous. When MIAs succeed, it is often because the non-member set is unintentionally shifted (e.g., temporally newer data) and thus easier to separate. The paper releases MIMIR, a unified benchmark and code for future audits.

Problem Statement

We lack a clear answer whether standard membership inference attacks can detect which texts were in an LLM's pretraining corpus. Prior work shows MIAs work on classifiers or fine-tuned LMs, but it's unclear if those results apply to large-scale pretraining. This paper evaluates existing MIAs at scale and diagnoses why they often fail or succeed.

Main Contribution

Large-scale evaluation of five black-box MIAs across LLM families (PYTHIA, GPT-NEO, SILO) and Pile domains.

Finding: MIAs are near-random in most domains (AUC < 0.6) but can be high when non-members are distributionally shifted.

Key Findings

Existing MIAs mostly fail against pre-trained LLMs.

NumbersMost AUC ROC < 0.6 across domains (Table 1).

Practical UseDon't assume a high MIA score means broad memorization of pretraining data; run domain tests before drawing privacy conclusions.

Evidence RefTable 1, Section 3.1

Training scale and schedule reduce the effectiveness of MIAs.

NumbersMIA spikes early in training then decreases across checkpoints; multi-epoch training raises AUC (Figure 2).

Practical UseAudits on fine-tuned or multi-epoch models will show more leakage than audits of single-epoch pretraining; test the exact training regime you care about.

Evidence RefFigure 2, Section 3.2.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Typical MIA AUC ROC (across domains)< 0.6 (near-random) on most domainsRandom = 0.5≈ +0.050.10 vs randomPYTHIA-DEDUP / Pile domains (Table 1)Table 1 shows most AUCs between ~0.49 and 0.56
GitHub domain AUC≈ 0.670.74 (best-performing domain)Other domains ~0.5+0.150.25 vs typical domainsGitHub subset of Pile (Table 1, B.3)Table 1 and discussion in Appendix B.3

What To Try In 7 Days

Run the MIMIR package against your model and domains to reproduce results quickly.

Compute n‑gram overlap between candidate non-members and your training corpus to check representativeness.

If auditing leakage, test on both single-epoch pretraining and any fine-tuning checkpoints you use in production.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation focuses on black-box MIAs; white-box or stronger extraction attacks may behave differently.

Target models are up to 12B and a subset of available LLM families; trends may differ at larger scales.

When Not To Use

If you audit fine-tuned models or multi-epoch training: expect higher leakage than reported here.

For white-box threat models or strong extraction attacks; other methods may find more leakage.

Failure Modes

Temporal or topical distribution shifts between members and non-members cause inflated MIA scores (false positives about 'memorization').

High n‑gram overlap obscures real leakage and causes near-chance MIA performance.

Core Entities

Models

PYTHIAPYTHIA-DEDUPGPT-NEOSILODATABLATIONSOLMOSTABLELM-BASE-ALPHA-3B-V2

Metrics

AUC ROCTPR @ low% FPR (e.g., 1% FPR)

Datasets

The PilePile-CCWikipediaArXivPubMed CentralGitHubDM MathematicsHackerNewsC4RealTimeData WikiText

Benchmarks

MIMIR (membership inference benchmark package)Temporal non-member benchmarksn-gram overlap thresholding benchmarks

Context Entities

Models

GPT-2OPTDISTILGPT2LLAMAStableLM