ChatGPT is weak at standard supervised IE but surprisingly strong at open extraction, explains itself well, yet is overconfident

Overview

Decision SnapshotNeeds Validation

The study gives concrete, manually verified results across 14 datasets, so conclusions about OpenIE strength, explainability, and miscalibration are well supported.

Citations59

Evidence Strength0.70

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 40%

Authors

Bo Li, Gexiang Fang, Yang Yang, Quansen Wang, Wei Ye, Wen Zhao, Shikun Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ChatGPT can propose high-quality extraction candidates and readable explanations without labels, but it is not a drop-in replacement for supervised IE when you need precise, well-calibrated automated extraction.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

This paper evaluates ChatGPT on seven fine-grained information extraction tasks (14 datasets) across four dimensions: performance, explainability, calibration, and faithfulness. Key takeaways: ChatGPT underperforms supervised baselines in the Standard-IE (labelled) setting, but human judges rate its outputs highly in an OpenIE (no label set) setting (e.g., 84–97% reasonable on ET/NER/RC). ChatGPT gives high-quality explanations and faithful alignment to input text (>95% faithfulness), but it is poorly calibrated and often overconfident (high ECEs and confidence gaps). The authors release annotated test sets and code.

Problem Statement

Can a conversational LLM (ChatGPT) reliably perform common information extraction tasks, explain its answers, provide calibrated uncertainties, and stay faithful to source text — and how does it compare to supervised baselines?

Main Contribution

Systematic evaluation of ChatGPT on 7 IE tasks (14 datasets) along four dimensions: performance, explainability, calibration, and faithfulness.

Comparison with BERT/RoBERTa and SOTA models under a Standard-IE (label set) and an OpenIE (no labels) setup.

Key Findings

ChatGPT underperforms supervised baselines on Standard-IE tasks.

NumbersStandard-IE full-test Micro-F1: ChatGPT often << SOTA (e.g., EE trigger/arg 16.6/7.8 vs SOTA ~72/56).

Practical UseDo not drop a supervised IE model for ChatGPT if you need state-of-the-art extraction accuracy on labelled tasks.

Evidence RefTable 2 (Standard-IE comparisons)

In the OpenIE setting, human experts rate ChatGPT's outputs as reasonable at high rates.

NumbersOpenIE judged reasonable: ET/NER/RC ~97.2%/93.3%/84.3% (sampled tests).

Practical UseUse ChatGPT as a zero-shot extractor or idea generator when no label set exists and human review is feasible.

Evidence RefTable 3 (OpenIE accuracy by human check)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Standard-IE performance (representative)	ChatGPT substantially below SOTA on many tasks (e.g., EE trigger/arg 16.6/7.8 vs SOTA ~72/56)	SOTA per task	Large negative gap on complex tasks	Full test sets (Table 2)	Table 2 shows ChatGPT much lower than supervised baselines on RE and EE	Table 2
OpenIE judged reasonable rate	ET/NER/RC judged reasonable: 97.2% / 93.3% / 84.3%	N/A (human judgement)	—	Sampled test sets (≈200 per dataset)	Table 3 reports human expert judgement on OpenIE outputs	Table 3

What To Try In 7 Days

Run ChatGPT in OpenIE mode on unlabeled corpora to generate candidate extractions for human review.

Use ChatGPT top-5 outputs as a short-listed candidate set to accelerate annotators and reduce labeling time.

Add a lightweight calibrator or thresholding step before accepting predictions for decisions that need reliable confidences.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/pkuserc/ChatGPT_for_IE

Data URLs

https://github.com/pkuserc/ChatGPT_for_IE

Risks & Boundaries

Limitations

ChatGPT is a closed model; internal training details are unknown and may affect reproducibility.

Human evaluation was sampled (~200 per dataset) and subjective, especially for OpenIE reasonableness.

When Not To Use

When you require SOTA supervised extraction accuracy on labelled datasets.

When you need well-calibrated probability estimates for automated decisions.

Failure Modes

Overconfidence: high predicted confidences even on incorrect outputs (poor calibration).

Low recall in top-1 predictions on complex tasks, causing missed extractions unless top-k is used.

Core Entities

Models

ChatGPTBERTRoBERTaSOTA (task-specific methods cited)

Metrics

Micro-F1F1 (trigger/argument)AccuracyTop-k recallExpected Calibration Error (ECE)Predicted confidence (1-100)

Datasets

BBNOntoNotes 5.0CoNLL2003TACREDSemEval2010ACE05-RSciERCACE05-EACE05-E+

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChatGPT underperforms supervised baselines on Standard-IE tasks.

In the OpenIE setting, human experts rate ChatGPT's outputs as reasonable at high rates.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding