ChatGPT is weak at standard supervised IE but surprisingly strong at open extraction, explains itself well, yet is overconfident

April 23, 20237 min

Overview

Decision SnapshotNeeds Validation

The study gives concrete, manually verified results across 14 datasets, so conclusions about OpenIE strength, explainability, and miscalibration are well supported.

Citations59

Evidence Strength0.70

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 40%

Authors

Bo Li, Gexiang Fang, Yang Yang, Quansen Wang, Wei Ye, Wen Zhao, Shikun Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ChatGPT can propose high-quality extraction candidates and readable explanations without labels, but it is not a drop-in replacement for supervised IE when you need precise, well-calibrated automated extraction.

Who Should Care

Summary TLDR

This paper evaluates ChatGPT on seven fine-grained information extraction tasks (14 datasets) across four dimensions: performance, explainability, calibration, and faithfulness. Key takeaways: ChatGPT underperforms supervised baselines in the Standard-IE (labelled) setting, but human judges rate its outputs highly in an OpenIE (no label set) setting (e.g., 84–97% reasonable on ET/NER/RC). ChatGPT gives high-quality explanations and faithful alignment to input text (>95% faithfulness), but it is poorly calibrated and often overconfident (high ECEs and confidence gaps). The authors release annotated test sets and code.

Problem Statement

Can a conversational LLM (ChatGPT) reliably perform common information extraction tasks, explain its answers, provide calibrated uncertainties, and stay faithful to source text — and how does it compare to supervised baselines?

Main Contribution

Systematic evaluation of ChatGPT on 7 IE tasks (14 datasets) along four dimensions: performance, explainability, calibration, and faithfulness.

Comparison with BERT/RoBERTa and SOTA models under a Standard-IE (label set) and an OpenIE (no labels) setup.

Key Findings

ChatGPT underperforms supervised baselines on Standard-IE tasks.

NumbersStandard-IE full-test Micro-F1: ChatGPT often << SOTA (e.g., EE trigger/arg 16.6/7.8 vs SOTA ~72/56).

Practical UseDo not drop a supervised IE model for ChatGPT if you need state-of-the-art extraction accuracy on labelled tasks.

Evidence RefTable 2 (Standard-IE comparisons)

In the OpenIE setting, human experts rate ChatGPT's outputs as reasonable at high rates.

NumbersOpenIE judged reasonable: ET/NER/RC ~97.2%/93.3%/84.3% (sampled tests).

Practical UseUse ChatGPT as a zero-shot extractor or idea generator when no label set exists and human review is feasible.

Evidence RefTable 3 (OpenIE accuracy by human check)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Standard-IE performance (representative)ChatGPT substantially below SOTA on many tasks (e.g., EE trigger/arg 16.6/7.8 vs SOTA ~72/56)SOTA per taskLarge negative gap on complex tasksFull test sets (Table 2)Table 2 shows ChatGPT much lower than supervised baselines on RE and EETable 2
OpenIE judged reasonable rateET/NER/RC judged reasonable: 97.2% / 93.3% / 84.3%N/A (human judgement)Sampled test sets (≈200 per dataset)Table 3 reports human expert judgement on OpenIE outputsTable 3

What To Try In 7 Days

Run ChatGPT in OpenIE mode on unlabeled corpora to generate candidate extractions for human review.

Use ChatGPT top-5 outputs as a short-listed candidate set to accelerate annotators and reduce labeling time.

Add a lightweight calibrator or thresholding step before accepting predictions for decisions that need reliable confidences.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

ChatGPT is a closed model; internal training details are unknown and may affect reproducibility.

Human evaluation was sampled (~200 per dataset) and subjective, especially for OpenIE reasonableness.

When Not To Use

When you require SOTA supervised extraction accuracy on labelled datasets.

When you need well-calibrated probability estimates for automated decisions.

Failure Modes

Overconfidence: high predicted confidences even on incorrect outputs (poor calibration).

Low recall in top-1 predictions on complex tasks, causing missed extractions unless top-k is used.

Core Entities

Models

ChatGPTBERTRoBERTaSOTA (task-specific methods cited)

Metrics

Micro-F1F1 (trigger/argument)AccuracyTop-k recallExpected Calibration Error (ECE)Predicted confidence (1-100)

Datasets

BBNOntoNotes 5.0CoNLL2003TACREDSemEval2010ACE05-RSciERCACE05-EACE05-E+