Systematic test shows current detectors fail to reliably spot ChatGPT text

April 4, 20236 min

Overview

Decision SnapshotNeeds Validation

Paper compiles an evaluation and a benchmark but does not provide released code or a fully public dataset; results show low detector effectiveness on their test set.

Citations36

Evidence Strength0.60

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 100%

Production readiness: 100%

Novelty: 100%

Authors

Alessandro Pegoraro, Kavita Kumari, Hossein Fereidooni, Ahmad-Reza Sadeghi

Links

Abstract / PDF

Why It Matters For Business

Current off-the-shelf detectors miss most ChatGPT outputs while rarely mislabeling human text; companies cannot depend on these tools alone for content safety or compliance.

Who Should Care

Summary TLDR

The authors collect a large benchmark of human and ChatGPT responses and test many public detectors. Across tools (research models and online services) none reliably flag ChatGPT text: the best true‑positive rate (TPR) on their benchmark is below 50% while true‑negative rates (TNR) are typically high (~90%+). The paper warns detectors are brittle and encourages rigorous testing before relying on them.

Problem Statement

ChatGPT and similar LLMs are widely used and easily misused (plagiarism, misinformation, cheating). Many tools claim to detect AI-generated text but their real-world effectiveness on ChatGPT is unclear. The paper asks: how well do existing detectors and online services distinguish ChatGPT outputs from human text on a broad benchmark?

Main Contribution

Built a benchmark from ~131k human and ChatGPT responses (derived from Guo et al.) and reduced it to a roughly 10% evaluation subset across medicine, finance, and open Q&A

Evaluated many published detectors and public online tools (e.g., OpenAI Classifier, GPTZero, ZeroGPT, Hugging Face, Perplexity, Writefull, Copyleaks) on that benchmark

Key Findings

No evaluated detector consistently detects ChatGPT-generated text.

NumbersBest observed TPR ≤ 47.3% on the paper's Table I

Practical UseDo not rely on current detectors alone to flag ChatGPT content; expect many false negatives.

Evidence RefTable I

Detectors tend to classify human text correctly but miss generated text.

NumbersMany tools report TNR ≈ 90% or higher (examples up to 99.3%)

Practical UseDetectors will rarely mistakenly label human text as AI, but they will often fail to catch AI outputs — add human review for risky cases.

Evidence RefTable I; Conclusion

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Best observed TPR47.3%Paper's benchmark (Table I)Table I lists Guo et al. detector TPR = 47.3%Table I
Typical TNR≈9099%Paper's benchmark (Table I)Multiple tools in Table I show TNR values ≥90%, e.g., Writefull 99.3%, Perplexity 98.3%Table I

What To Try In 7 Days

Run your content through several detectors from Table I and record TPR/TNR on a small labeled sample

Add mandatory human review or spot audits for high-risk content instead of fully trusting detectors

Measure detector performance by text length and domain; prioritize detectors that accept your input sizes

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark built from a subset (~10%) of a larger dataset; representativeness depends on selection method

The paper evaluates publicly available tools at a snapshot in time; tool behavior may change

When Not To Use

Do not assume detector performance generalizes to your domain without local testing

Avoid using these detectors as the sole evidence in high-stakes decisions (legal, medical, academic)

Failure Modes

High false negatives: many ChatGPT outputs go undetected

Length and language limits: short texts or unsupported languages reduce detector utility

Core Entities

Models

ChatGPTGPT-2GPT-3GPT-3.5GPT-4GroverRoBERTaDistilBERT

Metrics

True Positive Rate (TPR)True Negative Rate (TNR)Perplexity (PPL)

Datasets

Benchmark derived from Guo et al. (131,512 samples; reduced ~10%)Guo et al. dataset (used to generate prompts/responses)

Benchmarks

This paper's ChatGPT vs human benchmark (multi-domain, ~10% subset used for eval)