Overview
Thorough, reproducible experiments support conclusions, but results are scoped to black-box MIAs and evaluated model families up to 12B parameters.
Citations10
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
Most standard membership inference tests will not show large privacy leakage for models pre-trained at scale; but careless benchmark choices (e.g., temporally shifted non-members) can falsely signal leakage.
Who Should Care
Summary TLDR
The authors run a large-scale, reproducible evaluation of five membership inference attacks (MIAs) against language models trained on the Pile (PYTHIA suite, GPT‑NEO, SILO, etc.). Across most domains and model sizes (up to 12B), MIAs perform near random (AUC ≈ 0.5–0.6). Two main causes explain this: large pretraining corpora with near-one-epoch training reduce memorization, and high lexical overlap between training and candidate non-members makes membership ambiguous. When MIAs succeed, it is often because the non-member set is unintentionally shifted (e.g., temporally newer data) and thus easier to separate. The paper releases MIMIR, a unified benchmark and code for future audits.
Problem Statement
We lack a clear answer whether standard membership inference attacks can detect which texts were in an LLM's pretraining corpus. Prior work shows MIAs work on classifiers or fine-tuned LMs, but it's unclear if those results apply to large-scale pretraining. This paper evaluates existing MIAs at scale and diagnoses why they often fail or succeed.
Main Contribution
Large-scale evaluation of five black-box MIAs across LLM families (PYTHIA, GPT-NEO, SILO) and Pile domains.
Finding: MIAs are near-random in most domains (AUC < 0.6) but can be high when non-members are distributionally shifted.
Key Findings
Existing MIAs mostly fail against pre-trained LLMs.
Training scale and schedule reduce the effectiveness of MIAs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Typical MIA AUC ROC (across domains) | < 0.6 (near-random) on most domains | Random = 0.5 | ≈ +0.05–0.10 vs random | PYTHIA-DEDUP / Pile domains (Table 1) | Table 1 shows most AUCs between ~0.49 and 0.56 | — |
| GitHub domain AUC | ≈ 0.67–0.74 (best-performing domain) | Other domains ~0.5 | +0.15–0.25 vs typical domains | GitHub subset of Pile (Table 1, B.3) | Table 1 and discussion in Appendix B.3 | — |
What To Try In 7 Days
Run the MIMIR package against your model and domains to reproduce results quickly.
Compute n‑gram overlap between candidate non-members and your training corpus to check representativeness.
If auditing leakage, test on both single-epoch pretraining and any fine-tuning checkpoints you use in production.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Evaluation focuses on black-box MIAs; white-box or stronger extraction attacks may behave differently.
Target models are up to 12B and a subset of available LLM families; trends may differ at larger scales.
When Not To Use
If you audit fine-tuned models or multi-epoch training: expect higher leakage than reported here.
For white-box threat models or strong extraction attacks; other methods may find more leakage.
Failure Modes
Temporal or topical distribution shifts between members and non-members cause inflated MIA scores (false positives about 'memorization').
High n‑gram overlap obscures real leakage and causes near-chance MIA performance.

