Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
10
Why It Matters For Business
Most standard membership inference tests will not show large privacy leakage for models pre-trained at scale; but careless benchmark choices (e.g., temporally shifted non-members) can falsely signal leakage.
Summary TLDR
The authors run a large-scale, reproducible evaluation of five membership inference attacks (MIAs) against language models trained on the Pile (PYTHIA suite, GPT‑NEO, SILO, etc.). Across most domains and model sizes (up to 12B), MIAs perform near random (AUC ≈ 0.5–0.6). Two main causes explain this: large pretraining corpora with near-one-epoch training reduce memorization, and high lexical overlap between training and candidate non-members makes membership ambiguous. When MIAs succeed, it is often because the non-member set is unintentionally shifted (e.g., temporally newer data) and thus easier to separate. The paper releases MIMIR, a unified benchmark and code for future audits.
Problem Statement
We lack a clear answer whether standard membership inference attacks can detect which texts were in an LLM's pretraining corpus. Prior work shows MIAs work on classifiers or fine-tuned LMs, but it's unclear if those results apply to large-scale pretraining. This paper evaluates existing MIAs at scale and diagnoses why they often fail or succeed.
Main Contribution
Large-scale evaluation of five black-box MIAs across LLM families (PYTHIA, GPT-NEO, SILO) and Pile domains.
Finding: MIAs are near-random in most domains (AUC < 0.6) but can be high when non-members are distributionally shifted.
Diagnosis: two confounders — massive pretraining with ~1 epoch (reduces memorization) and high n-gram overlap between members and non-members (makes membership fuzzy).
Benchmarks and code release: MIMIR package and HuggingFace dataset for reproducible MIA evaluation.
Key Findings
Existing MIAs mostly fail against pre-trained LLMs.
Training scale and schedule reduce the effectiveness of MIAs.
High lexical overlap between members and non-members makes membership ambiguous.
Selecting temporally shifted non-members inflates MIA performance.
Small lexical or semantic edits convert 'members' into confidently classified non-members.
Results
Typical MIA AUC ROC (across domains)
GitHub domain AUC
AUC under temporal shift
7‑gram overlap (average)
Who Should Care
What To Try In 7 Days
Run the MIMIR package against your model and domains to reproduce results quickly.
Compute n‑gram overlap between candidate non-members and your training corpus to check representativeness.
If auditing leakage, test on both single-epoch pretraining and any fine-tuning checkpoints you use in production.
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation focuses on black-box MIAs; white-box or stronger extraction attacks may behave differently.
- Target models are up to 12B and a subset of available LLM families; trends may differ at larger scales.
- High-overlap domains (GitHub) and decontamination choices can produce outlier results.
- Cannot fully reproduce all prior work pipelines and exact non-member choices, limiting direct comparisons.
When Not To Use
- If you audit fine-tuned models or multi-epoch training: expect higher leakage than reported here.
- For white-box threat models or strong extraction attacks; other methods may find more leakage.
- To conclude absence of all privacy risk from AUC ≈ 0.5; extractability/PII tests need separate analysis.
Failure Modes
- Temporal or topical distribution shifts between members and non-members cause inflated MIA scores (false positives about 'memorization').
- High n‑gram overlap obscures real leakage and causes near-chance MIA performance.
- Small edits or paraphrases of true members are often classified as non-members, causing false negatives relative to practical leakage.
Core Entities
Models
- PYTHIA
- PYTHIA-DEDUP
- GPT-NEO
- SILO
- DATABLATIONS
- OLMO
- STABLELM-BASE-ALPHA-3B-V2
Metrics
- AUC ROC
- TPR @ low% FPR (e.g., 1% FPR)
Datasets
- The Pile
- Pile-CC
- Wikipedia
- ArXiv
- PubMed Central
- GitHub
- DM Mathematics
- HackerNews
- C4
- RealTimeData WikiText
Benchmarks
- MIMIR (membership inference benchmark package)
- Temporal non-member benchmarks
- n-gram overlap thresholding benchmarks
Context Entities
Models
- GPT-2
- OPT
- DISTILGPT2
- LLAMA
- StableLM

