Membership inference mostly fails on pretrained LLMs; apparent successes often come from dataset shifts

Overview

Decision SnapshotNeeds Validation

Thorough, reproducible experiments support conclusions, but results are scoped to black-box MIAs and evaluated model families up to 12B parameters.

Citations10

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 60%

Authors

Michael Duan, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, Hannaneh Hajishirzi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Most standard membership inference tests will not show large privacy leakage for models pre-trained at scale; but careless benchmark choices (e.g., temporally shifted non-members) can falsely signal leakage.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Engineering Lead

Summary TLDR

The authors run a large-scale, reproducible evaluation of five membership inference attacks (MIAs) against language models trained on the Pile (PYTHIA suite, GPT‑NEO, SILO, etc.). Across most domains and model sizes (up to 12B), MIAs perform near random (AUC ≈ 0.5–0.6). Two main causes explain this: large pretraining corpora with near-one-epoch training reduce memorization, and high lexical overlap between training and candidate non-members makes membership ambiguous. When MIAs succeed, it is often because the non-member set is unintentionally shifted (e.g., temporally newer data) and thus easier to separate. The paper releases MIMIR, a unified benchmark and code for future audits.

Problem Statement

We lack a clear answer whether standard membership inference attacks can detect which texts were in an LLM's pretraining corpus. Prior work shows MIAs work on classifiers or fine-tuned LMs, but it's unclear if those results apply to large-scale pretraining. This paper evaluates existing MIAs at scale and diagnoses why they often fail or succeed.

Main Contribution

Large-scale evaluation of five black-box MIAs across LLM families (PYTHIA, GPT-NEO, SILO) and Pile domains.

Finding: MIAs are near-random in most domains (AUC < 0.6) but can be high when non-members are distributionally shifted.

Key Findings

Existing MIAs mostly fail against pre-trained LLMs.

NumbersMost AUC ROC < 0.6 across domains (Table 1).

Practical UseDon't assume a high MIA score means broad memorization of pretraining data; run domain tests before drawing privacy conclusions.

Evidence RefTable 1, Section 3.1

Training scale and schedule reduce the effectiveness of MIAs.

NumbersMIA spikes early in training then decreases across checkpoints; multi-epoch training raises AUC (Figure 2).

Practical UseAudits on fine-tuned or multi-epoch models will show more leakage than audits of single-epoch pretraining; test the exact training regime you care about.

Evidence RefFigure 2, Section 3.2.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Typical MIA AUC ROC (across domains)	< 0.6 (near-random) on most domains	Random = 0.5	≈ +0.05–0.10 vs random	PYTHIA-DEDUP / Pile domains (Table 1)	Table 1 shows most AUCs between ~0.49 and 0.56	—
GitHub domain AUC	≈ 0.67–0.74 (best-performing domain)	Other domains ~0.5	+0.15–0.25 vs typical domains	GitHub subset of Pile (Table 1, B.3)	Table 1 and discussion in Appendix B.3	—

What To Try In 7 Days

Run the MIMIR package against your model and domains to reproduce results quickly.

Compute n‑gram overlap between candidate non-members and your training corpus to check representativeness.

If auditing leakage, test on both single-epoch pretraining and any fine-tuning checkpoints you use in production.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

http://github.com/iamgroot42/mimir

Data URLs

https://huggingface.co/datasets/iamgroot42/mimir

Risks & Boundaries

Limitations

Evaluation focuses on black-box MIAs; white-box or stronger extraction attacks may behave differently.

Target models are up to 12B and a subset of available LLM families; trends may differ at larger scales.

When Not To Use

If you audit fine-tuned models or multi-epoch training: expect higher leakage than reported here.

For white-box threat models or strong extraction attacks; other methods may find more leakage.

Failure Modes

Temporal or topical distribution shifts between members and non-members cause inflated MIA scores (false positives about 'memorization').

High n‑gram overlap obscures real leakage and causes near-chance MIA performance.

Core Entities

Models

PYTHIAPYTHIA-DEDUPGPT-NEOSILODATABLATIONSOLMOSTABLELM-BASE-ALPHA-3B-V2

Metrics

AUC ROCTPR @ low% FPR (e.g., 1% FPR)

Datasets

The PilePile-CCWikipediaArXivPubMed CentralGitHubDM MathematicsHackerNewsC4RealTimeData WikiText

Benchmarks

MIMIR (membership inference benchmark package)Temporal non-member benchmarksn-gram overlap thresholding benchmarks

Context Entities

Models

GPT-2OPTDISTILGPT2LLAMAStableLM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Existing MIAs mostly fail against pre-trained LLMs.

Training scale and schedule reduce the effectiveness of MIAs.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding