ChatGPT can produce word-level explanations that match classic methods on faithfulness but differ sharply in form and reliability

Overview

Decision SnapshotReady For Pilot

The paper shows practical, low-cost ways to get ChatGPT explanations and demonstrates key failure modes; however results are limited to ChatGPT on a small SST sample and to automatic metrics.

Citations19

Evidence Strength0.75

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, Leilani H. Gilpin

Links

Abstract / PDF / Data

Why It Matters For Business

LLM self-explanations can replace expensive explainer runs for quick audits and UX features, but their coarseness and prompt sensitivity mean you must validate high-stakes uses with extra tests or humans.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

This paper measures how well ChatGPT can generate feature-attribution explanations (word importance scores) for sentiment analysis. On a 100-sentence SST subset, ChatGPT's self-explanations score roughly the same as occlusion and LIME on common automatic faithfulness metrics, while being far cheaper to get. However, self-explanations are structurally different: they use few distinct saliency levels, produce rounded/confident prediction scores, and are often insensitive to single-word removals. These properties make common token-level evaluation metrics (e.g., decision-flip and rank-del) unreliable for such models. The authors recommend caution using token-level faithfulness metrics and call

Problem Statement

Can modern instruction-tuned LLMs (ChatGPT) generate faithful, usable feature-attribution explanations (word importance) for sentiment analysis, and how do those self-explanations compare to classical methods like occlusion and LIME?

Main Contribution

A systematic pipeline to elicit ChatGPT self-explanations in two orders: explain-then-predict (E-P) and predict-then-explain (P-E).

Empirical comparison of ChatGPT self-explanations to occlusion and LIME on five faithfulness metrics (comprehensiveness, sufficiency, DFMIT, DFFrac, RankDel) using SST sentences.

Key Findings

Self-explanations score similarly to LIME and occlusion on faithfulness metrics.

NumbersE-P comprehensiveness: SELFEXP 0.19 vs LIME 0.17 (Table VII)

Practical UseYou can use ChatGPT-generated word attributions as a low-cost alternative to LIME for quick checks, but verify with other analyses before trusting fine-grained claims.

Evidence RefTable VII

Asking the model to produce explanations reduces classification accuracy.

NumbersPrediction-only 92% vs E-P 85% and P-E 88% (Table VI)

Practical UseIf prediction accuracy is critical, avoid forcing per-token attributions during inference; run explanations separately or use predict-only calls.

Evidence RefTable VI

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Prediction-only 92%; E-P 85%; P-E 88%; E-P top-k 80%; P-E top-k 83%	Prediction-only 92%	E-P -7pp; P-E -4pp	SST test subset (n=100)	Table VI	Table VI
comprehensiveness (E-P)	SELFEXP 0.19; LIME 0.17; Occlusion 0.15	Occlusion 0.15	SELFEXP +0.04 vs Occlusion	SST	Table VII	Table VII

What To Try In 7 Days

Run ChatGPT self-explanations on 100 representative examples and compare with occlusion/LIME to spot consistent differences.

If latency/cost matters, use self-explanations for dashboards and reserve LIME for spot checks.

Avoid using token-deletion faithfulness scores as sole evidence; sample human reviews for understandability.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

Stanford Sentiment Treebank (SST)

Risks & Boundaries

Limitations

Experiments run only on OpenAI ChatGPT and a 100-sentence subset of SST; results may not generalize to other LLMs or tasks.

Evaluations rely on automatic token-deletion metrics that authors show are unreliable for rounded model outputs.

When Not To Use

Do not use ChatGPT self-explanations as sole evidence in high-stakes decisions (legal, medical, compliance).

Avoid trusting single-word importance claims for long sentences or nuanced reasoning.

Failure Modes

Rounded attribution levels produce low-resolution importance maps that miss subtle influences.

Prediction outputs are often insensitive to single-word deletions, breaking deletion-based faithfulness tests.

Core Entities

Models

ChatGPT

Metrics

comprehensivenesssufficiencydecision flip rate (DFMIT)DFFracRankDel

Datasets

Stanford Sentiment Treebank (SST)

Context Entities

Models

BERTRoBERTaGPT-1GPT-2

Metrics

comprehensivenesssufficiency

Datasets

Stanford Sentiment Treebank (SST)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Self-explanations score similarly to LIME and occlusion on faithfulness metrics.

Asking the model to produce explanations reduces classification accuracy.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding