Overview
The study gives concrete, manually verified results across 14 datasets, so conclusions about OpenIE strength, explainability, and miscalibration are well supported.
Citations59
Evidence Strength0.70
Confidence0.88
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 50%
Novelty: 40%
Why It Matters For Business
ChatGPT can propose high-quality extraction candidates and readable explanations without labels, but it is not a drop-in replacement for supervised IE when you need precise, well-calibrated automated extraction.
Who Should Care
Summary TLDR
This paper evaluates ChatGPT on seven fine-grained information extraction tasks (14 datasets) across four dimensions: performance, explainability, calibration, and faithfulness. Key takeaways: ChatGPT underperforms supervised baselines in the Standard-IE (labelled) setting, but human judges rate its outputs highly in an OpenIE (no label set) setting (e.g., 84–97% reasonable on ET/NER/RC). ChatGPT gives high-quality explanations and faithful alignment to input text (>95% faithfulness), but it is poorly calibrated and often overconfident (high ECEs and confidence gaps). The authors release annotated test sets and code.
Problem Statement
Can a conversational LLM (ChatGPT) reliably perform common information extraction tasks, explain its answers, provide calibrated uncertainties, and stay faithful to source text — and how does it compare to supervised baselines?
Main Contribution
Systematic evaluation of ChatGPT on 7 IE tasks (14 datasets) along four dimensions: performance, explainability, calibration, and faithfulness.
Comparison with BERT/RoBERTa and SOTA models under a Standard-IE (label set) and an OpenIE (no labels) setup.
Key Findings
ChatGPT underperforms supervised baselines on Standard-IE tasks.
In the OpenIE setting, human experts rate ChatGPT's outputs as reasonable at high rates.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Standard-IE performance (representative) | ChatGPT substantially below SOTA on many tasks (e.g., EE trigger/arg 16.6/7.8 vs SOTA ~72/56) | SOTA per task | Large negative gap on complex tasks | Full test sets (Table 2) | Table 2 shows ChatGPT much lower than supervised baselines on RE and EE | Table 2 |
| OpenIE judged reasonable rate | ET/NER/RC judged reasonable: 97.2% / 93.3% / 84.3% | N/A (human judgement) | — | Sampled test sets (≈200 per dataset) | Table 3 reports human expert judgement on OpenIE outputs | Table 3 |
What To Try In 7 Days
Run ChatGPT in OpenIE mode on unlabeled corpora to generate candidate extractions for human review.
Use ChatGPT top-5 outputs as a short-listed candidate set to accelerate annotators and reduce labeling time.
Add a lightweight calibrator or thresholding step before accepting predictions for decisions that need reliable confidences.
Reproducibility
Risks & Boundaries
Limitations
ChatGPT is a closed model; internal training details are unknown and may affect reproducibility.
Human evaluation was sampled (~200 per dataset) and subjective, especially for OpenIE reasonableness.
When Not To Use
When you require SOTA supervised extraction accuracy on labelled datasets.
When you need well-calibrated probability estimates for automated decisions.
Failure Modes
Overconfidence: high predicted confidences even on incorrect outputs (poor calibration).
Low recall in top-1 predictions on complex tasks, causing missed extractions unless top-k is used.

