ChatGPT can produce word-level explanations that match classic methods on faithfulness but differ sharply in form and reliability

October 17, 20237 min

Overview

Decision SnapshotReady For Pilot

The paper shows practical, low-cost ways to get ChatGPT explanations and demonstrates key failure modes; however results are limited to ChatGPT on a small SST sample and to automatic metrics.

Citations19

Evidence Strength0.75

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, Leilani H. Gilpin

Links

Abstract / PDF / Data

Why It Matters For Business

LLM self-explanations can replace expensive explainer runs for quick audits and UX features, but their coarseness and prompt sensitivity mean you must validate high-stakes uses with extra tests or humans.

Who Should Care

Summary TLDR

This paper measures how well ChatGPT can generate feature-attribution explanations (word importance scores) for sentiment analysis. On a 100-sentence SST subset, ChatGPT's self-explanations score roughly the same as occlusion and LIME on common automatic faithfulness metrics, while being far cheaper to get. However, self-explanations are structurally different: they use few distinct saliency levels, produce rounded/confident prediction scores, and are often insensitive to single-word removals. These properties make common token-level evaluation metrics (e.g., decision-flip and rank-del) unreliable for such models. The authors recommend caution using token-level faithfulness metrics and call 

Problem Statement

Can modern instruction-tuned LLMs (ChatGPT) generate faithful, usable feature-attribution explanations (word importance) for sentiment analysis, and how do those self-explanations compare to classical methods like occlusion and LIME?

Main Contribution

A systematic pipeline to elicit ChatGPT self-explanations in two orders: explain-then-predict (E-P) and predict-then-explain (P-E).

Empirical comparison of ChatGPT self-explanations to occlusion and LIME on five faithfulness metrics (comprehensiveness, sufficiency, DFMIT, DFFrac, RankDel) using SST sentences.

Key Findings

Self-explanations score similarly to LIME and occlusion on faithfulness metrics.

NumbersE-P comprehensiveness: SELFEXP 0.19 vs LIME 0.17 (Table VII)

Practical UseYou can use ChatGPT-generated word attributions as a low-cost alternative to LIME for quick checks, but verify with other analyses before trusting fine-grained claims.

Evidence RefTable VII

Asking the model to produce explanations reduces classification accuracy.

NumbersPrediction-only 92% vs E-P 85% and P-E 88% (Table VI)

Practical UseIf prediction accuracy is critical, avoid forcing per-token attributions during inference; run explanations separately or use predict-only calls.

Evidence RefTable VI

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyPrediction-only 92%; E-P 85%; P-E 88%; E-P top-k 80%; P-E top-k 83%Prediction-only 92%E-P -7pp; P-E -4ppSST test subset (n=100)Table VITable VI
comprehensiveness (E-P)SELFEXP 0.19; LIME 0.17; Occlusion 0.15Occlusion 0.15SELFEXP +0.04 vs OcclusionSSTTable VIITable VII

What To Try In 7 Days

Run ChatGPT self-explanations on 100 representative examples and compare with occlusion/LIME to spot consistent differences.

If latency/cost matters, use self-explanations for dashboards and reserve LIME for spot checks.

Avoid using token-deletion faithfulness scores as sole evidence; sample human reviews for understandability.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Stanford Sentiment Treebank (SST)

Risks & Boundaries

Limitations

Experiments run only on OpenAI ChatGPT and a 100-sentence subset of SST; results may not generalize to other LLMs or tasks.

Evaluations rely on automatic token-deletion metrics that authors show are unreliable for rounded model outputs.

When Not To Use

Do not use ChatGPT self-explanations as sole evidence in high-stakes decisions (legal, medical, compliance).

Avoid trusting single-word importance claims for long sentences or nuanced reasoning.

Failure Modes

Rounded attribution levels produce low-resolution importance maps that miss subtle influences.

Prediction outputs are often insensitive to single-word deletions, breaking deletion-based faithfulness tests.

Core Entities

Models

ChatGPT

Metrics

comprehensivenesssufficiencydecision flip rate (DFMIT)DFFracRankDel

Datasets

Stanford Sentiment Treebank (SST)

Context Entities

Models

BERTRoBERTaGPT-1GPT-2

Metrics

comprehensivenesssufficiency

Datasets

Stanford Sentiment Treebank (SST)