Adaptive system that detects and masks personal data to meet GDPR and CCPA rules

January 21, 20256 min

Overview

Decision SnapshotNeeds Validation

Benchmarks and a small human study show strong detection for several PII types, but key evaluations use an internal dataset and limited human tests, so broader generalization needs more validation.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 45%

Authors

Shubhi Asthana, Ruchi Mahindru, Bing Zhang, Jorge Sanz

Links

Abstract / PDF / Data

Why It Matters For Business

Automated, policy-aware PII detection reduces legal risk and audit effort while preserving data utility for ML pipelines.

Who Should Care

Summary TLDR

This paper presents OneShield, an adaptive pipeline that detects personally identifiable information (PII) in text with context-aware scoring and applies regulation-aware masking strategies. Benchmarks on an in-house set (~1,500 points) and a Kaggle PII dataset show strong F1 scores (e.g., passport numbers 0.95–1.0). A 20-person study rated perceived protection 4.6/5. The system integrates a policy engine to map laws (GDPR/CCPA) into actionable masking rules and logs actions for audits.

Problem Statement

LLMs consume large public text corpora but legal rules (GDPR, CCPA, PIPEDA) vary by jurisdiction. Static redaction or pattern rules either miss sensitive cases or remove useful context. Enterprises need a scalable, updatable system that detects PII with context and applies jurisdiction-specific remediation without wrecking downstream model utility.

Main Contribution

Adaptive Risk Mitigation Framework: a policy-driven system that converts laws into executable masking rules.

Contextual PII Detection: multi-step detector that scores entity sensitivity using local semantics and metadata.

Key Findings

Passport number detection outperforms other tools on evaluated benchmarks.

NumbersOneShield F1=0.95 (Bench1); Presidio 0.33; Comprehend 0.54

Practical UseUse the OneShield detector where accurate passport masking matters; it reduces leakage risk compared to common open tools on these datasets.

Evidence RefTable 2 (PassportNumber row, Benchmark1)

Person name detection is near-perfect on evaluated data.

NumbersOneShield Person F1=1.00 (Bench1); StarPII 0.99; Comprehend 0.88

Practical UseExpect very few missed names on similar text types; you can rely on it to automatically mask or pseudonymize names with low manual review.

Evidence RefTable 2 (Person row, Benchmark1)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
PassportNumber F1 (Benchmark1)0.95Presidio 0.33; Comprehend 0.54OneShield +0.62 vs PresidioBenchmark1 (in-house)Table 2 passport rowTable 2
Person name F1 (Benchmark1)1.00StarPII 0.99; Comprehend 0.88Comparable to best open-source NERBenchmark1 (in-house)Table 2 person rowTable 2

What To Try In 7 Days

Run OneShield or a contextual PII detector over a small training slice and compare F1 on known PII

Map your GDPR/CCPA must-rules into a policy table and test masking behaviors

Enable audit logging for remediation actions for one model pipeline and review outputs

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation relies on an internal in-house dataset (~1,500 points) and one public Kaggle set; generalization to other domains is unproven

Human trust study is small (n=20) and subjective

When Not To Use

When legal audit requires fully auditable, certified third-party tools without internal customization

Where latency must be minimal and complex contextual scoring cannot be afforded

Failure Modes

False negatives on nested or obfuscated PII (e.g., email inside URLs) as noted for other tools

False positives on public organizations or names without contextual disambiguation

Core Entities

Models

OneShield PII detectorStarPII

Metrics

F1User trust score

Datasets

Benchmark1 (in-house, ~1500 examples)Kaggle PII detection dataset (pii-detection-dataset-gpt)

Benchmarks

Benchmark1 in-houseBenchmark2 Kaggle PII dataset