MGTBench: a modular benchmark that measures how well detectors spot and attribute text from modern LLMs and how brittle they are to attacks

Overview

Decision SnapshotReady For Pilot

The benchmark is useful in practice: it provides code, covers many detectors and LLMs, and shows realistic limits (short text, adversarial attacks). Results are strong for fine-tuned detectors but fragile under simple attacks.

Citations31

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/8

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Xinlei He, Xinyue Shen, Zeyuan Chen, Michael Backes, Yang Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated detection helps flag AI-written content that affects trust, compliance, or fraud; MGTBench identifies which detectors work, how much labelled data they need, and where they fail under attacks.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The authors build MGTBench, a modular benchmark and codebase to evaluate methods that detect or attribute machine-generated text (MGT). They compare 8 metric-based signals (log-likelihood, rank, entropy, DetectGPT, etc.) and 5 model-based detectors (RoBERTa/BERT fine-tuned detectors, GPTZero, ConDA) on three datasets (Essay, WP, Reuters) and six LLMs (ChatGPT-turbo, Claude, ChatGLM, Dolly, GPT4All, StableLM). Main takeaways: a fine-tuned LM detector usually gives the highest F1, log-likelihood-style metrics transfer well across LLMs, 200 words is typically enough for near-peak performance, many methods need only a few metric calibration samples, and all detectors are vulnerable to paraphrase

Problem Statement

There is no single, comparable evaluation of machine-generated-text detectors against modern LLMs. Prior work reports mixed settings, models, and datasets, so it is unclear which detectors work best, how well they transfer, and how robust they are to simple attacks.

Main Contribution

MGTBench: a modular benchmarking framework and codebase for detection and attribution of machine-generated text

Large empirical study: 13 detection methods, 6 LLMs, 3 datasets; head-to-head F1 and runtime comparisons

Key Findings

Fine-tuned LM Detector gives the highest detection accuracy across datasets

NumbersF1=0.993 (Essay, human vs ChatGPT-turbo)

Practical UseIf you can fine-tune a classifier on representative MGT and human text, expect top detection performance; deploy this when labeled examples are available.

Evidence RefTable 2

A simple metric (log-likelihood) transfers well across LLMs

NumbersLog-Likelihood trained on GPT4All → F1=0.983 on ChatGLM (Essay)

Practical UseFor rapid, low-cost detection across unknown LLMs, calibrate a log-likelihood threshold on one generator and reuse it on others.

Evidence RefAbstract / Transfer section

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LM Detector F1 (human vs ChatGPT-turbo)	0.993	—	—	Essay	LM Detector achieves F1=0.993 on Essay (Table 2).	Table 2
Log-Likelihood F1 (human vs ChatGPT-turbo)	0.968	—	—	Essay	Log-Likelihood reaches F1=0.968 on Essay (Table 2).	Table 2

What To Try In 7 Days

Run MGTBench on your corpus to baseline exposure to AI-written text using Log-Likelihood and LM Detector

Calibrate a log-likelihood threshold with 10–50 examples for quick cross-LLM coverage

Run a paraphrase and spacing test (easy to automate) to estimate detector fragility on your data

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/xinleihe/MGTBench

Data URLs

https://github.com/xinleihe/MGTBench

Risks & Boundaries

Limitations

Evaluated on 6 LLMs and 3 English datasets only; not exhaustive across domains or languages

Metric-based and model-based methods studied; new approaches like prompt-based detection are not included

When Not To Use

For very short texts (below ~50 words) — detectors lose reliability

When adversaries can paraphrase or perturb text without constraint

Failure Modes

Model-based detectors can overfit dataset-specific cues and not generalize to new topics

Metric-based detectors can misattribute among multiple LLMs and rely on length or formatting artifacts

Core Entities

Models

ChatGPT-turboClaudeChatGLMDollyGPT4AllStableLMGPT2-mediumRoBERTaBERT

Metrics

Log-LikelihoodRankLog-RankEntropyGLTRDetectGPTLRRNPRF1-scoreAUC

Datasets

EssayWPReuters

Benchmarks

MGTBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-tuned LM Detector gives the highest detection accuracy across datasets

A simple metric (log-likelihood) transfers well across LLMs

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding