Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
19
Why It Matters For Business
LLM self-explanations can replace expensive explainer runs for quick audits and UX features, but their coarseness and prompt sensitivity mean you must validate high-stakes uses with extra tests or humans.
Summary TLDR
This paper measures how well ChatGPT can generate feature-attribution explanations (word importance scores) for sentiment analysis. On a 100-sentence SST subset, ChatGPT's self-explanations score roughly the same as occlusion and LIME on common automatic faithfulness metrics, while being far cheaper to get. However, self-explanations are structurally different: they use few distinct saliency levels, produce rounded/confident prediction scores, and are often insensitive to single-word removals. These properties make common token-level evaluation metrics (e.g., decision-flip and rank-del) unreliable for such models. The authors recommend caution using token-level faithfulness metrics and call
Problem Statement
Can modern instruction-tuned LLMs (ChatGPT) generate faithful, usable feature-attribution explanations (word importance) for sentiment analysis, and how do those self-explanations compare to classical methods like occlusion and LIME?
Main Contribution
A systematic pipeline to elicit ChatGPT self-explanations in two orders: explain-then-predict (E-P) and predict-then-explain (P-E).
Empirical comparison of ChatGPT self-explanations to occlusion and LIME on five faithfulness metrics (comprehensiveness, sufficiency, DFMIT, DFFrac, RankDel) using SST sentences.
Qualitative analyses exposing rounded saliency levels, prediction roundedness, insensitivity to single-word removals, and prompt-dependence; practical warnings for common token-level evaluation metrics.
Key Findings
Self-explanations score similarly to LIME and occlusion on faithfulness metrics.
Asking the model to produce explanations reduces classification accuracy.
Model predictions and attributions are coarse and often insensitive to single-word deletions.
Different explanation methods disagree substantially despite similar faithfulness scores.
Top-k style explanations (highlighting a few words) are not uniformly better than full attributions.
Results
Accuracy
comprehensiveness (E-P)
DFMIT (P-E)
occlusion saliency zeros
Who Should Care
What To Try In 7 Days
Run ChatGPT self-explanations on 100 representative examples and compare with occlusion/LIME to spot consistent differences.
If latency/cost matters, use self-explanations for dashboards and reserve LIME for spot checks.
Avoid using token-deletion faithfulness scores as sole evidence; sample human reviews for understandability.
Reproducibility
Data Urls
- Stanford Sentiment Treebank (SST)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments run only on OpenAI ChatGPT and a 100-sentence subset of SST; results may not generalize to other LLMs or tasks.
- Evaluations rely on automatic token-deletion metrics that authors show are unreliable for rounded model outputs.
- No human-subject or retraining-based evaluations were performed.
When Not To Use
- Do not use ChatGPT self-explanations as sole evidence in high-stakes decisions (legal, medical, compliance).
- Avoid trusting single-word importance claims for long sentences or nuanced reasoning.
- Do not replace rigorous feature-attribution audits or retesting when model fine-tuning is possible.
Failure Modes
- Rounded attribution levels produce low-resolution importance maps that miss subtle influences.
- Prediction outputs are often insensitive to single-word deletions, breaking deletion-based faithfulness tests.
- Explanations are prompt-dependent; different prompt orders (E-P vs P-E) yield different explanations and accuracies.
Core Entities
Models
- ChatGPT
Metrics
- comprehensiveness
- sufficiency
- decision flip rate (DFMIT)
- DFFrac
- RankDel
Datasets
- Stanford Sentiment Treebank (SST)
Context Entities
Models
- BERT
- RoBERTa
- GPT-1
- GPT-2
Metrics
- comprehensiveness
- sufficiency
Datasets
- Stanford Sentiment Treebank (SST)

