SEA-CoT: pick self-entailment aligned chain-of-thoughts to make explanations more faithful, robust and useful

Overview

Decision SnapshotReady For Pilot

Method is simple and practical: rank sampled chains by self-computed entailment and overlap. Results are reproducible but were run on a single LLM family and use external APIs for perturbations.

Citations5

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Wei Jie Yeo, Ranjan Satapathy, Rick Siow Mong Goh, Erik Cambria

Links

Abstract / PDF / Code

Why It Matters For Business

Better explanation selection improves trust and makes model outputs more useful for training smaller systems and auditing model decisions.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The paper evaluates how well Chain-of-Thought (CoT) and related prompting methods produce usable, faithful, and robust reasoning explanations from one LLM. It introduces Self-Entailment-Alignment CoT (SEA-CoT): generate multiple reasoning traces, then pick the one that best entails the question+answer and overlaps key tokens. On three commonsense datasets SEA-CoT improves aggregate interpretability vs several baselines, raises simulatability for student models, and reduces counterfactual unfaithfulness. Code is provided.

Problem Statement

Researchers often judge generated CoT explanations only by faithfulness. That misses other practical traits like robustness to wording changes and utility for teaching smaller models. We need a broad, actionable evaluation and a simple method to pick more interpretable explanations from sampled chains.

Main Contribution

Define a three-part interpretability evaluation: faithfulness, robustness, and utility (simulatability).

Propose SEA-CoT: rank sampled CoT traces by entailment to (question+answer) and token overlap, selecting the most aligned trace.

Key Findings

SEA-CoT wins on aggregate interpretability across prompts and datasets.

NumbersSEA-CoT >75% aggregate improvement on OBQA vs baselines

Practical UseIf you sample multiple CoT traces, rank them by entailment+overlap (SEA-CoT) to get more interpretable explanations in practice.

Evidence RefSection 6.1, Figure 5

Selecting explanations by both entailment and token overlap reduces counterfactual unfaithfulness and raises simulatability.

NumbersStrategyQA ablation: CF-UF 3.81 -> vs Random 6.44; Simu 16.97

Practical UseUse SEA-CoT's O&E ranking rather than random or max-probability to make explanations less likely to ignore input edits and more useful for training small models.

Evidence RefTable 1 (Ablation)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
OBQA aggregate interpretability (SEA-CoT vs baselines)	>75% improvement (aggregate) on OBQA	other prompting baselines (CoT, SC-CoT, QD, SR)	>75%	OBQA test	Section 6.1, Figure 5	Figure 5
StrategyQA ablation (SEA-CoT O&E)	Para 1.2, CF-UF 3.81, M 61.24, S 16.97	Random selection: Para 6.1, CF-UF 6.44, M 62.17, S 11.87	CF-UF -2.63 (41% relative reduction); S +5.1	StrategyQA test	Table 1 ablation results	Table 1

What To Try In 7 Days

Sample N=10 CoT traces and rerank by entailment+token overlap (SEA-CoT) before choosing an explanation.

Measure simulatability: append explanations to inputs and fine-tune a small student (e.g., T5-base) to test LAS gains.

Run quick robustness checks: paraphrase and insert a small mistake to see if answers flip often.

Optimization Features

Model Optimization

GPTQ 4-bit post-training quantization to run Llama-2 70B locally

Inference Optimization

Use Huggingface text-generation-inference for faster serving

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/SenticNet/CoT_interpretability

Risks & Boundaries

Limitations

Experiments use only Llama-2 family (quantized); results may differ on other architectures.

Perturbation generation relied on GPT-3.5/4 for paraphrase/mistake/counterfactual edits, adding potential bias.

When Not To Use

When you need grounded factual checks via external retrieval — SEA-CoT ranks internal alignment, not external truth.

Where latency or token cost forbids sampling dozens of chains.

Failure Modes

SEA-CoT can favor plausible but factually incorrect chains if the model's entailment scorer hallucinates.

High task accuracy can reduce sensitivity of mistake-insertion tests, masking unfaithfulness.

Core Entities

Models

Llama-2 70BLlama-2 13BLlama-2 7BGPT-3.5 (used for perturbations)GPT-4 (used for counterfactual generation)

Metrics

Leakage-Adjusted Simulatability (LAS)Paraphrase flip percentageCounterfactual unfaithfulness (CF-UF)Mistake-insertion flip % (M)Simulatability (S)Aggregate normalized interpretability score

Datasets

OpenBookQA (OBQA)CommonsenseQA (CSQA)QASCStrategyQA

Benchmarks

Commonsense reasoning benchmarks

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SEA-CoT wins on aggregate interpretability across prompts and datasets.

Selecting explanations by both entailment and token overlap reduces counterfactual unfaithfulness and raises simulatability.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding