Adaptive system that detects and masks personal data to meet GDPR and CCPA rules

January 21, 20256 min

Overview

Production Readiness

0.7

Novelty Score

0.45

Cost Impact Score

0.6

Citation Count

1

Authors

Shubhi Asthana, Ruchi Mahindru, Bing Zhang, Jorge Sanz

Links

Abstract / PDF

Why It Matters For Business

Automated, policy-aware PII detection reduces legal risk and audit effort while preserving data utility for ML pipelines.

Summary TLDR

This paper presents OneShield, an adaptive pipeline that detects personally identifiable information (PII) in text with context-aware scoring and applies regulation-aware masking strategies. Benchmarks on an in-house set (~1,500 points) and a Kaggle PII dataset show strong F1 scores (e.g., passport numbers 0.95–1.0). A 20-person study rated perceived protection 4.6/5. The system integrates a policy engine to map laws (GDPR/CCPA) into actionable masking rules and logs actions for audits.

Problem Statement

LLMs consume large public text corpora but legal rules (GDPR, CCPA, PIPEDA) vary by jurisdiction. Static redaction or pattern rules either miss sensitive cases or remove useful context. Enterprises need a scalable, updatable system that detects PII with context and applies jurisdiction-specific remediation without wrecking downstream model utility.

Main Contribution

Adaptive Risk Mitigation Framework: a policy-driven system that converts laws into executable masking rules.

Contextual PII Detection: multi-step detector that scores entity sensitivity using local semantics and metadata.

Adaptive Masking: regulation-aware remediation (pseudonymize, hash, obfuscate) to balance privacy and data utility.

Enterprise integration: deployment inside OneShield Guardrails with audit logs and rule templates.

Key Findings

Passport number detection outperforms other tools on evaluated benchmarks.

NumbersOneShield F1=0.95 (Bench1); Presidio 0.33; Comprehend 0.54

Person name detection is near-perfect on evaluated data.

NumbersOneShield Person F1=1.00 (Bench1); StarPII 0.99; Comprehend 0.88

Users rated perceived privacy protection high in a small human study.

NumbersUser trust score = 4.6 / 5 (n=20)

Results

PassportNumber F1 (Benchmark1)

Value0.95

BaselinePresidio 0.33; Comprehend 0.54

Person name F1 (Benchmark1)

Value1.00

BaselineStarPII 0.99; Comprehend 0.88

Date F1 (Benchmark1)

Value0.94

BaselinePresidio 0.62; Comprehend 0.76

User trust (human study)

Value4.6 / 5

Who Should Care

What To Try In 7 Days

Run OneShield or a contextual PII detector over a small training slice and compare F1 on known PII

Map your GDPR/CCPA must-rules into a policy table and test masking behaviors

Enable audit logging for remediation actions for one model pipeline and review outputs

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation relies on an internal in-house dataset (~1,500 points) and one public Kaggle set; generalization to other domains is unproven
  • Human trust study is small (n=20) and subjective
  • Edge cases such as public-figure exemptions and cross-jurisdiction conflicts need better automated resolution
  • Implementation details and code are not provided, limiting reproducibility

When Not To Use

  • When legal audit requires fully auditable, certified third-party tools without internal customization
  • Where latency must be minimal and complex contextual scoring cannot be afforded
  • If you need open-source code or reproducible research artifacts from the paper

Failure Modes

  • False negatives on nested or obfuscated PII (e.g., email inside URLs) as noted for other tools
  • False positives on public organizations or names without contextual disambiguation
  • Conflicting jurisdiction rules causing inconsistent masking across deployments
  • Dependency on domain experts to tune policy mappings may create misconfigurations

Core Entities

Models

  • OneShield PII detector
  • StarPII

Metrics

  • F1
  • User trust score

Datasets

  • Benchmark1 (in-house, ~1500 examples)
  • Kaggle PII detection dataset (pii-detection-dataset-gpt)

Benchmarks

  • Benchmark1 in-house
  • Benchmark2 Kaggle PII dataset