Overview
Production Readiness
0.5
Novelty Score
0.4
Cost Impact Score
0.4
Citation Count
59
Why It Matters For Business
ChatGPT can propose high-quality extraction candidates and readable explanations without labels, but it is not a drop-in replacement for supervised IE when you need precise, well-calibrated automated extraction.
Summary TLDR
This paper evaluates ChatGPT on seven fine-grained information extraction tasks (14 datasets) across four dimensions: performance, explainability, calibration, and faithfulness. Key takeaways: ChatGPT underperforms supervised baselines in the Standard-IE (labelled) setting, but human judges rate its outputs highly in an OpenIE (no label set) setting (e.g., 84–97% reasonable on ET/NER/RC). ChatGPT gives high-quality explanations and faithful alignment to input text (>95% faithfulness), but it is poorly calibrated and often overconfident (high ECEs and confidence gaps). The authors release annotated test sets and code.
Problem Statement
Can a conversational LLM (ChatGPT) reliably perform common information extraction tasks, explain its answers, provide calibrated uncertainties, and stay faithful to source text — and how does it compare to supervised baselines?
Main Contribution
Systematic evaluation of ChatGPT on 7 IE tasks (14 datasets) along four dimensions: performance, explainability, calibration, and faithfulness.
Comparison with BERT/RoBERTa and SOTA models under a Standard-IE (label set) and an OpenIE (no labels) setup.
Collected 10 model-produced keys and 5 human-annotated keys (15 total) and manually annotated ≈3,000 test samples; released datasets and code.
Key Findings
ChatGPT underperforms supervised baselines on Standard-IE tasks.
In the OpenIE setting, human experts rate ChatGPT's outputs as reasonable at high rates.
ChatGPT provides high-quality explanations that humans mostly accept.
ChatGPT is poorly calibrated and tends to be overconfident.
ChatGPT's explanations are usually faithful to the input text.
Top-k outputs make ChatGPT a useful candidate generator.
Results
Standard-IE performance (representative)
OpenIE judged reasonable rate
Top-5 recall (useful candidate generation)
Calibration (ECE)
Faithfulness (human check)
Who Should Care
What To Try In 7 Days
Run ChatGPT in OpenIE mode on unlabeled corpora to generate candidate extractions for human review.
Use ChatGPT top-5 outputs as a short-listed candidate set to accelerate annotators and reduce labeling time.
Add a lightweight calibrator or thresholding step before accepting predictions for decisions that need reliable confidences.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- ChatGPT is a closed model; internal training details are unknown and may affect reproducibility.
- Human evaluation was sampled (~200 per dataset) and subjective, especially for OpenIE reasonableness.
- Standard-IE prompts were concise and unified; task-specific prompt engineering could change performance.
- Calibration and confidence depend on ChatGPT's self-reported scores, which may be manipulable.
When Not To Use
- When you require SOTA supervised extraction accuracy on labelled datasets.
- When you need well-calibrated probability estimates for automated decisions.
- For complex relation and event extraction without human review, where precision matters.
Failure Modes
- Overconfidence: high predicted confidences even on incorrect outputs (poor calibration).
- Low recall in top-1 predictions on complex tasks, causing missed extractions unless top-k is used.
- Standard-IE label selection confusion when label sets are large or labels are ambiguous.
Core Entities
Models
- ChatGPT
- BERT
- RoBERTa
- SOTA (task-specific methods cited)
Metrics
- Micro-F1
- F1 (trigger/argument)
- Accuracy
- Top-k recall
- Expected Calibration Error (ECE)
- Predicted confidence (1-100)
Datasets
- BBN
- OntoNotes 5.0
- CoNLL2003
- TACRED
- SemEval2010
- ACE05-R
- SciERC
- ACE05-E
- ACE05-E+

