Overview
The paper shows practical, low-cost ways to get ChatGPT explanations and demonstrates key failure modes; however results are limited to ChatGPT on a small SST sample and to automatic metrics.
Citations19
Evidence Strength0.75
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
LLM self-explanations can replace expensive explainer runs for quick audits and UX features, but their coarseness and prompt sensitivity mean you must validate high-stakes uses with extra tests or humans.
Who Should Care
Summary TLDR
This paper measures how well ChatGPT can generate feature-attribution explanations (word importance scores) for sentiment analysis. On a 100-sentence SST subset, ChatGPT's self-explanations score roughly the same as occlusion and LIME on common automatic faithfulness metrics, while being far cheaper to get. However, self-explanations are structurally different: they use few distinct saliency levels, produce rounded/confident prediction scores, and are often insensitive to single-word removals. These properties make common token-level evaluation metrics (e.g., decision-flip and rank-del) unreliable for such models. The authors recommend caution using token-level faithfulness metrics and call
Problem Statement
Can modern instruction-tuned LLMs (ChatGPT) generate faithful, usable feature-attribution explanations (word importance) for sentiment analysis, and how do those self-explanations compare to classical methods like occlusion and LIME?
Main Contribution
A systematic pipeline to elicit ChatGPT self-explanations in two orders: explain-then-predict (E-P) and predict-then-explain (P-E).
Empirical comparison of ChatGPT self-explanations to occlusion and LIME on five faithfulness metrics (comprehensiveness, sufficiency, DFMIT, DFFrac, RankDel) using SST sentences.
Key Findings
Self-explanations score similarly to LIME and occlusion on faithfulness metrics.
Asking the model to produce explanations reduces classification accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | Prediction-only 92%; E-P 85%; P-E 88%; E-P top-k 80%; P-E top-k 83% | Prediction-only 92% | E-P -7pp; P-E -4pp | SST test subset (n=100) | Table VI | Table VI |
| comprehensiveness (E-P) | SELFEXP 0.19; LIME 0.17; Occlusion 0.15 | Occlusion 0.15 | SELFEXP +0.04 vs Occlusion | SST | Table VII | Table VII |
What To Try In 7 Days
Run ChatGPT self-explanations on 100 representative examples and compare with occlusion/LIME to spot consistent differences.
If latency/cost matters, use self-explanations for dashboards and reserve LIME for spot checks.
Avoid using token-deletion faithfulness scores as sole evidence; sample human reviews for understandability.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Experiments run only on OpenAI ChatGPT and a 100-sentence subset of SST; results may not generalize to other LLMs or tasks.
Evaluations rely on automatic token-deletion metrics that authors show are unreliable for rounded model outputs.
When Not To Use
Do not use ChatGPT self-explanations as sole evidence in high-stakes decisions (legal, medical, compliance).
Avoid trusting single-word importance claims for long sentences or nuanced reasoning.
Failure Modes
Rounded attribution levels produce low-resolution importance maps that miss subtle influences.
Prediction outputs are often insensitive to single-word deletions, breaking deletion-based faithfulness tests.

