Systematic test shows current detectors fail to reliably spot ChatGPT text

Overview

Decision SnapshotNeeds Validation

Paper compiles an evaluation and a benchmark but does not provide released code or a fully public dataset; results show low detector effectiveness on their test set.

Citations36

Evidence Strength0.60

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 100%

Production readiness: 100%

Novelty: 100%

Authors

Alessandro Pegoraro, Kavita Kumari, Hossein Fereidooni, Ahmad-Reza Sadeghi

Links

Abstract / PDF

Why It Matters For Business

Current off-the-shelf detectors miss most ChatGPT outputs while rarely mislabeling human text; companies cannot depend on these tools alone for content safety or compliance.

Who Should Care

Product Manager CTO ML Engineer Data Scientist

Summary TLDR

The authors collect a large benchmark of human and ChatGPT responses and test many public detectors. Across tools (research models and online services) none reliably flag ChatGPT text: the best true‑positive rate (TPR) on their benchmark is below 50% while true‑negative rates (TNR) are typically high (~90%+). The paper warns detectors are brittle and encourages rigorous testing before relying on them.

Problem Statement

ChatGPT and similar LLMs are widely used and easily misused (plagiarism, misinformation, cheating). Many tools claim to detect AI-generated text but their real-world effectiveness on ChatGPT is unclear. The paper asks: how well do existing detectors and online services distinguish ChatGPT outputs from human text on a broad benchmark?

Main Contribution

Built a benchmark from ~131k human and ChatGPT responses (derived from Guo et al.) and reduced it to a roughly 10% evaluation subset across medicine, finance, and open Q&A

Evaluated many published detectors and public online tools (e.g., OpenAI Classifier, GPTZero, ZeroGPT, Hugging Face, Perplexity, Writefull, Copyleaks) on that benchmark

Key Findings

No evaluated detector consistently detects ChatGPT-generated text.

NumbersBest observed TPR ≤ 47.3% on the paper's Table I

Practical UseDo not rely on current detectors alone to flag ChatGPT content; expect many false negatives.

Evidence RefTable I

Detectors tend to classify human text correctly but miss generated text.

NumbersMany tools report TNR ≈ 90% or higher (examples up to 99.3%)

Practical UseDetectors will rarely mistakenly label human text as AI, but they will often fail to catch AI outputs — add human review for risky cases.

Evidence RefTable I; Conclusion

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Best observed TPR	47.3%	—	—	Paper's benchmark (Table I)	Table I lists Guo et al. detector TPR = 47.3%	Table I
Typical TNR	≈90–99%	—	—	Paper's benchmark (Table I)	Multiple tools in Table I show TNR values ≥90%, e.g., Writefull 99.3%, Perplexity 98.3%	Table I

What To Try In 7 Days

Run your content through several detectors from Table I and record TPR/TNR on a small labeled sample

Add mandatory human review or spot audits for high-risk content instead of fully trusting detectors

Measure detector performance by text length and domain; prioritize detectors that accept your input sizes

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Benchmark built from a subset (~10%) of a larger dataset; representativeness depends on selection method

The paper evaluates publicly available tools at a snapshot in time; tool behavior may change

When Not To Use

Do not assume detector performance generalizes to your domain without local testing

Avoid using these detectors as the sole evidence in high-stakes decisions (legal, medical, academic)

Failure Modes

High false negatives: many ChatGPT outputs go undetected

Length and language limits: short texts or unsupported languages reduce detector utility

Core Entities

Models

ChatGPTGPT-2GPT-3GPT-3.5GPT-4GroverRoBERTaDistilBERT

Metrics

True Positive Rate (TPR)True Negative Rate (TNR)Perplexity (PPL)

Datasets

Benchmark derived from Guo et al. (131,512 samples; reduced ~10%)Guo et al. dataset (used to generate prompts/responses)

Benchmarks

This paper's ChatGPT vs human benchmark (multi-domain, ~10% subset used for eval)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

No evaluated detector consistently detects ChatGPT-generated text.

Detectors tend to classify human text correctly but miss generated text.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding