Overview
Paper compiles an evaluation and a benchmark but does not provide released code or a fully public dataset; results show low detector effectiveness on their test set.
Citations36
Evidence Strength0.60
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 0/3
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 100%
Production readiness: 100%
Novelty: 100%
Why It Matters For Business
Current off-the-shelf detectors miss most ChatGPT outputs while rarely mislabeling human text; companies cannot depend on these tools alone for content safety or compliance.
Who Should Care
Summary TLDR
The authors collect a large benchmark of human and ChatGPT responses and test many public detectors. Across tools (research models and online services) none reliably flag ChatGPT text: the best true‑positive rate (TPR) on their benchmark is below 50% while true‑negative rates (TNR) are typically high (~90%+). The paper warns detectors are brittle and encourages rigorous testing before relying on them.
Problem Statement
ChatGPT and similar LLMs are widely used and easily misused (plagiarism, misinformation, cheating). Many tools claim to detect AI-generated text but their real-world effectiveness on ChatGPT is unclear. The paper asks: how well do existing detectors and online services distinguish ChatGPT outputs from human text on a broad benchmark?
Main Contribution
Built a benchmark from ~131k human and ChatGPT responses (derived from Guo et al.) and reduced it to a roughly 10% evaluation subset across medicine, finance, and open Q&A
Evaluated many published detectors and public online tools (e.g., OpenAI Classifier, GPTZero, ZeroGPT, Hugging Face, Perplexity, Writefull, Copyleaks) on that benchmark
Key Findings
No evaluated detector consistently detects ChatGPT-generated text.
Detectors tend to classify human text correctly but miss generated text.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Best observed TPR | 47.3% | — | — | Paper's benchmark (Table I) | Table I lists Guo et al. detector TPR = 47.3% | Table I |
| Typical TNR | ≈90–99% | — | — | Paper's benchmark (Table I) | Multiple tools in Table I show TNR values ≥90%, e.g., Writefull 99.3%, Perplexity 98.3% | Table I |
What To Try In 7 Days
Run your content through several detectors from Table I and record TPR/TNR on a small labeled sample
Add mandatory human review or spot audits for high-risk content instead of fully trusting detectors
Measure detector performance by text length and domain; prioritize detectors that accept your input sizes
Reproducibility
Risks & Boundaries
Limitations
Benchmark built from a subset (~10%) of a larger dataset; representativeness depends on selection method
The paper evaluates publicly available tools at a snapshot in time; tool behavior may change
When Not To Use
Do not assume detector performance generalizes to your domain without local testing
Avoid using these detectors as the sole evidence in high-stakes decisions (legal, medical, academic)
Failure Modes
High false negatives: many ChatGPT outputs go undetected
Length and language limits: short texts or unsupported languages reduce detector utility

