ChatGPT can produce word-level explanations that match classic methods on faithfulness but differ sharply in form and reliability

October 17, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

19

Authors

Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, Leilani H. Gilpin

Links

Abstract / PDF

Why It Matters For Business

LLM self-explanations can replace expensive explainer runs for quick audits and UX features, but their coarseness and prompt sensitivity mean you must validate high-stakes uses with extra tests or humans.

Summary TLDR

This paper measures how well ChatGPT can generate feature-attribution explanations (word importance scores) for sentiment analysis. On a 100-sentence SST subset, ChatGPT's self-explanations score roughly the same as occlusion and LIME on common automatic faithfulness metrics, while being far cheaper to get. However, self-explanations are structurally different: they use few distinct saliency levels, produce rounded/confident prediction scores, and are often insensitive to single-word removals. These properties make common token-level evaluation metrics (e.g., decision-flip and rank-del) unreliable for such models. The authors recommend caution using token-level faithfulness metrics and call 

Problem Statement

Can modern instruction-tuned LLMs (ChatGPT) generate faithful, usable feature-attribution explanations (word importance) for sentiment analysis, and how do those self-explanations compare to classical methods like occlusion and LIME?

Main Contribution

A systematic pipeline to elicit ChatGPT self-explanations in two orders: explain-then-predict (E-P) and predict-then-explain (P-E).

Empirical comparison of ChatGPT self-explanations to occlusion and LIME on five faithfulness metrics (comprehensiveness, sufficiency, DFMIT, DFFrac, RankDel) using SST sentences.

Qualitative analyses exposing rounded saliency levels, prediction roundedness, insensitivity to single-word removals, and prompt-dependence; practical warnings for common token-level evaluation metrics.

Key Findings

Self-explanations score similarly to LIME and occlusion on faithfulness metrics.

NumbersE-P comprehensiveness: SELFEXP 0.19 vs LIME 0.17 (Table VII)

Asking the model to produce explanations reduces classification accuracy.

NumbersPrediction-only 92% vs E-P 85% and P-E 88% (Table VI)

Model predictions and attributions are coarse and often insensitive to single-word deletions.

Numbers82.6% of words have zero occlusion saliency for E-P; many predictions stay unchanged after several deletions (Tables XI–

Different explanation methods disagree substantially despite similar faithfulness scores.

Top-k style explanations (highlighting a few words) are not uniformly better than full attributions.

NumbersTopk DFMIT@k: TOPK 0.34 vs SELFEXP 0.32 and LIME 0.40 in some splits (Table VIII)

Results

Accuracy

ValuePrediction-only 92%; E-P 85%; P-E 88%; E-P top-k 80%; P-E top-k 83%

BaselinePrediction-only 92%

comprehensiveness (E-P)

ValueSELFEXP 0.19; LIME 0.17; Occlusion 0.15

BaselineOcclusion 0.15

DFMIT (P-E)

ValueLIME 0.10; Occlusion 0.14; SELFEXP 0.07

BaselineOcclusion 0.14

occlusion saliency zeros

Value82.6% words have zero occlusion saliency (E-P)

Who Should Care

What To Try In 7 Days

Run ChatGPT self-explanations on 100 representative examples and compare with occlusion/LIME to spot consistent differences.

If latency/cost matters, use self-explanations for dashboards and reserve LIME for spot checks.

Avoid using token-deletion faithfulness scores as sole evidence; sample human reviews for understandability.

Reproducibility

Data Urls

  • Stanford Sentiment Treebank (SST)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments run only on OpenAI ChatGPT and a 100-sentence subset of SST; results may not generalize to other LLMs or tasks.
  • Evaluations rely on automatic token-deletion metrics that authors show are unreliable for rounded model outputs.
  • No human-subject or retraining-based evaluations were performed.

When Not To Use

  • Do not use ChatGPT self-explanations as sole evidence in high-stakes decisions (legal, medical, compliance).
  • Avoid trusting single-word importance claims for long sentences or nuanced reasoning.
  • Do not replace rigorous feature-attribution audits or retesting when model fine-tuning is possible.

Failure Modes

  • Rounded attribution levels produce low-resolution importance maps that miss subtle influences.
  • Prediction outputs are often insensitive to single-word deletions, breaking deletion-based faithfulness tests.
  • Explanations are prompt-dependent; different prompt orders (E-P vs P-E) yield different explanations and accuracies.

Core Entities

Models

  • ChatGPT

Metrics

  • comprehensiveness
  • sufficiency
  • decision flip rate (DFMIT)
  • DFFrac
  • RankDel

Datasets

  • Stanford Sentiment Treebank (SST)

Context Entities

Models

  • BERT
  • RoBERTa
  • GPT-1
  • GPT-2

Metrics

  • comprehensiveness
  • sufficiency

Datasets

  • Stanford Sentiment Treebank (SST)