ChatGPT is weak at standard supervised IE but surprisingly strong at open extraction, explains itself well, yet is overconfident

April 23, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.4

Cost Impact Score

0.4

Citation Count

59

Authors

Bo Li, Gexiang Fang, Yang Yang, Quansen Wang, Wei Ye, Wen Zhao, Shikun Zhang

Links

Abstract / PDF

Why It Matters For Business

ChatGPT can propose high-quality extraction candidates and readable explanations without labels, but it is not a drop-in replacement for supervised IE when you need precise, well-calibrated automated extraction.

Summary TLDR

This paper evaluates ChatGPT on seven fine-grained information extraction tasks (14 datasets) across four dimensions: performance, explainability, calibration, and faithfulness. Key takeaways: ChatGPT underperforms supervised baselines in the Standard-IE (labelled) setting, but human judges rate its outputs highly in an OpenIE (no label set) setting (e.g., 84–97% reasonable on ET/NER/RC). ChatGPT gives high-quality explanations and faithful alignment to input text (>95% faithfulness), but it is poorly calibrated and often overconfident (high ECEs and confidence gaps). The authors release annotated test sets and code.

Problem Statement

Can a conversational LLM (ChatGPT) reliably perform common information extraction tasks, explain its answers, provide calibrated uncertainties, and stay faithful to source text — and how does it compare to supervised baselines?

Main Contribution

Systematic evaluation of ChatGPT on 7 IE tasks (14 datasets) along four dimensions: performance, explainability, calibration, and faithfulness.

Comparison with BERT/RoBERTa and SOTA models under a Standard-IE (label set) and an OpenIE (no labels) setup.

Collected 10 model-produced keys and 5 human-annotated keys (15 total) and manually annotated ≈3,000 test samples; released datasets and code.

Key Findings

ChatGPT underperforms supervised baselines on Standard-IE tasks.

NumbersStandard-IE full-test Micro-F1: ChatGPT often << SOTA (e.g., EE trigger/arg 16.6/7.8 vs SOTA ~72/56).

In the OpenIE setting, human experts rate ChatGPT's outputs as reasonable at high rates.

NumbersOpenIE judged reasonable: ET/NER/RC ~97.2%/93.3%/84.3% (sampled tests).

ChatGPT provides high-quality explanations that humans mostly accept.

NumbersReasonableness (human-check) often >90% across tasks in both Standard and OpenIE.

ChatGPT is poorly calibrated and tends to be overconfident.

NumbersExpected Calibration Error (ECE) much higher for ChatGPT (e.g., SemEval RC 0.46; ACE05-R 0.745) than BERT/RoBERTa.

ChatGPT's explanations are usually faithful to the input text.

NumbersFaithfulness >95% on nearly all datasets and settings (many entries ~98–100%).

Top-k outputs make ChatGPT a useful candidate generator.

NumbersTop-5 recall example: BBN 94.9%, SemEval2010 76.0% (top-1 much lower).

Results

Standard-IE performance (representative)

ValueChatGPT substantially below SOTA on many tasks (e.g., EE trigger/arg 16.6/7.8 vs SOTA ~72/56)

BaselineSOTA per task

OpenIE judged reasonable rate

ValueET/NER/RC judged reasonable: 97.2% / 93.3% / 84.3%

BaselineN/A (human judgement)

Top-5 recall (useful candidate generation)

ValueBBN top-5 recall 94.9%; SemEval2010 top-5 recall 76.0%

Baselinetop-1 recall lower (e.g., SemEval top-1 42.5%)

Calibration (ECE)

ValueChatGPT ECEs high (examples: SemEval 0.46; ACE05-R 0.745; many >0.6)

BaselineBERT/RoBERTa ECE typically <0.3

Faithfulness (human check)

ValueFaithful explanations >95% in most datasets

BaselineN/A

Who Should Care

What To Try In 7 Days

Run ChatGPT in OpenIE mode on unlabeled corpora to generate candidate extractions for human review.

Use ChatGPT top-5 outputs as a short-listed candidate set to accelerate annotators and reduce labeling time.

Add a lightweight calibrator or thresholding step before accepting predictions for decisions that need reliable confidences.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • ChatGPT is a closed model; internal training details are unknown and may affect reproducibility.
  • Human evaluation was sampled (~200 per dataset) and subjective, especially for OpenIE reasonableness.
  • Standard-IE prompts were concise and unified; task-specific prompt engineering could change performance.
  • Calibration and confidence depend on ChatGPT's self-reported scores, which may be manipulable.

When Not To Use

  • When you require SOTA supervised extraction accuracy on labelled datasets.
  • When you need well-calibrated probability estimates for automated decisions.
  • For complex relation and event extraction without human review, where precision matters.

Failure Modes

  • Overconfidence: high predicted confidences even on incorrect outputs (poor calibration).
  • Low recall in top-1 predictions on complex tasks, causing missed extractions unless top-k is used.
  • Standard-IE label selection confusion when label sets are large or labels are ambiguous.

Core Entities

Models

  • ChatGPT
  • BERT
  • RoBERTa
  • SOTA (task-specific methods cited)

Metrics

  • Micro-F1
  • F1 (trigger/argument)
  • Accuracy
  • Top-k recall
  • Expected Calibration Error (ECE)
  • Predicted confidence (1-100)

Datasets

  • BBN
  • OntoNotes 5.0
  • CoNLL2003
  • TACRED
  • SemEval2010
  • ACE05-R
  • SciERC
  • ACE05-E
  • ACE05-E+