MGTBench: a modular benchmark that measures how well detectors spot and attribute text from modern LLMs and how brittle they are to attacks

March 26, 20238 min

Overview

Decision SnapshotReady For Pilot

The benchmark is useful in practice: it provides code, covers many detectors and LLMs, and shows realistic limits (short text, adversarial attacks). Results are strong for fine-tuned detectors but fragile under simple attacks.

Citations31

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Xinlei He, Xinyue Shen, Zeyuan Chen, Michael Backes, Yang Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated detection helps flag AI-written content that affects trust, compliance, or fraud; MGTBench identifies which detectors work, how much labelled data they need, and where they fail under attacks.

Who Should Care

Summary TLDR

The authors build MGTBench, a modular benchmark and codebase to evaluate methods that detect or attribute machine-generated text (MGT). They compare 8 metric-based signals (log-likelihood, rank, entropy, DetectGPT, etc.) and 5 model-based detectors (RoBERTa/BERT fine-tuned detectors, GPTZero, ConDA) on three datasets (Essay, WP, Reuters) and six LLMs (ChatGPT-turbo, Claude, ChatGLM, Dolly, GPT4All, StableLM). Main takeaways: a fine-tuned LM detector usually gives the highest F1, log-likelihood-style metrics transfer well across LLMs, 200 words is typically enough for near-peak performance, many methods need only a few metric calibration samples, and all detectors are vulnerable to paraphrase

Problem Statement

There is no single, comparable evaluation of machine-generated-text detectors against modern LLMs. Prior work reports mixed settings, models, and datasets, so it is unclear which detectors work best, how well they transfer, and how robust they are to simple attacks.

Main Contribution

MGTBench: a modular benchmarking framework and codebase for detection and attribution of machine-generated text

Large empirical study: 13 detection methods, 6 LLMs, 3 datasets; head-to-head F1 and runtime comparisons

Key Findings

Fine-tuned LM Detector gives the highest detection accuracy across datasets

NumbersF1=0.993 (Essay, human vs ChatGPT-turbo)

Practical UseIf you can fine-tune a classifier on representative MGT and human text, expect top detection performance; deploy this when labeled examples are available.

Evidence RefTable 2

A simple metric (log-likelihood) transfers well across LLMs

NumbersLog-Likelihood trained on GPT4All → F1=0.983 on ChatGLM (Essay)

Practical UseFor rapid, low-cost detection across unknown LLMs, calibrate a log-likelihood threshold on one generator and reuse it on others.

Evidence RefAbstract / Transfer section

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LM Detector F1 (human vs ChatGPT-turbo)0.993EssayLM Detector achieves F1=0.993 on Essay (Table 2).Table 2
Log-Likelihood F1 (human vs ChatGPT-turbo)0.968EssayLog-Likelihood reaches F1=0.968 on Essay (Table 2).Table 2

What To Try In 7 Days

Run MGTBench on your corpus to baseline exposure to AI-written text using Log-Likelihood and LM Detector

Calibrate a log-likelihood threshold with 10–50 examples for quick cross-LLM coverage

Run a paraphrase and spacing test (easy to automate) to estimate detector fragility on your data

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluated on 6 LLMs and 3 English datasets only; not exhaustive across domains or languages

Metric-based and model-based methods studied; new approaches like prompt-based detection are not included

When Not To Use

For very short texts (below ~50 words) — detectors lose reliability

When adversaries can paraphrase or perturb text without constraint

Failure Modes

Model-based detectors can overfit dataset-specific cues and not generalize to new topics

Metric-based detectors can misattribute among multiple LLMs and rely on length or formatting artifacts

Core Entities

Models

ChatGPT-turboClaudeChatGLMDollyGPT4AllStableLMGPT2-mediumRoBERTaBERT

Metrics

Log-LikelihoodRankLog-RankEntropyGLTRDetectGPTLRRNPRF1-scoreAUC

Datasets

EssayWPReuters

Benchmarks

MGTBench