SEA-CoT: pick self-entailment aligned chain-of-thoughts to make explanations more faithful, robust and useful

February 19, 20246 min

Overview

Decision SnapshotReady For Pilot

Method is simple and practical: rank sampled chains by self-computed entailment and overlap. Results are reproducible but were run on a single LLM family and use external APIs for perturbations.

Citations5

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Wei Jie Yeo, Ranjan Satapathy, Rick Siow Mong Goh, Erik Cambria

Links

Abstract / PDF / Code

Why It Matters For Business

Better explanation selection improves trust and makes model outputs more useful for training smaller systems and auditing model decisions.

Who Should Care

Summary TLDR

The paper evaluates how well Chain-of-Thought (CoT) and related prompting methods produce usable, faithful, and robust reasoning explanations from one LLM. It introduces Self-Entailment-Alignment CoT (SEA-CoT): generate multiple reasoning traces, then pick the one that best entails the question+answer and overlaps key tokens. On three commonsense datasets SEA-CoT improves aggregate interpretability vs several baselines, raises simulatability for student models, and reduces counterfactual unfaithfulness. Code is provided.

Problem Statement

Researchers often judge generated CoT explanations only by faithfulness. That misses other practical traits like robustness to wording changes and utility for teaching smaller models. We need a broad, actionable evaluation and a simple method to pick more interpretable explanations from sampled chains.

Main Contribution

Define a three-part interpretability evaluation: faithfulness, robustness, and utility (simulatability).

Propose SEA-CoT: rank sampled CoT traces by entailment to (question+answer) and token overlap, selecting the most aligned trace.

Key Findings

SEA-CoT wins on aggregate interpretability across prompts and datasets.

NumbersSEA-CoT >75% aggregate improvement on OBQA vs baselines

Practical UseIf you sample multiple CoT traces, rank them by entailment+overlap (SEA-CoT) to get more interpretable explanations in practice.

Evidence RefSection 6.1, Figure 5

Selecting explanations by both entailment and token overlap reduces counterfactual unfaithfulness and raises simulatability.

NumbersStrategyQA ablation: CF-UF 3.81 -> vs Random 6.44; Simu 16.97

Practical UseUse SEA-CoT's O&E ranking rather than random or max-probability to make explanations less likely to ignore input edits and more useful for training small models.

Evidence RefTable 1 (Ablation)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
OBQA aggregate interpretability (SEA-CoT vs baselines)>75% improvement (aggregate) on OBQAother prompting baselines (CoT, SC-CoT, QD, SR)>75%OBQA testSection 6.1, Figure 5Figure 5
StrategyQA ablation (SEA-CoT O&E)Para 1.2, CF-UF 3.81, M 61.24, S 16.97Random selection: Para 6.1, CF-UF 6.44, M 62.17, S 11.87CF-UF -2.63 (41% relative reduction); S +5.1StrategyQA testTable 1 ablation resultsTable 1

What To Try In 7 Days

Sample N=10 CoT traces and rerank by entailment+token overlap (SEA-CoT) before choosing an explanation.

Measure simulatability: append explanations to inputs and fine-tune a small student (e.g., T5-base) to test LAS gains.

Run quick robustness checks: paraphrase and insert a small mistake to see if answers flip often.

Optimization Features

Model Optimization
GPTQ 4-bit post-training quantization to run Llama-2 70B locally
Inference Optimization
Use Huggingface text-generation-inference for faster serving

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Experiments use only Llama-2 family (quantized); results may differ on other architectures.

Perturbation generation relied on GPT-3.5/4 for paraphrase/mistake/counterfactual edits, adding potential bias.

When Not To Use

When you need grounded factual checks via external retrieval — SEA-CoT ranks internal alignment, not external truth.

Where latency or token cost forbids sampling dozens of chains.

Failure Modes

SEA-CoT can favor plausible but factually incorrect chains if the model's entailment scorer hallucinates.

High task accuracy can reduce sensitivity of mistake-insertion tests, masking unfaithfulness.

Core Entities

Models

Llama-2 70BLlama-2 13BLlama-2 7BGPT-3.5 (used for perturbations)GPT-4 (used for counterfactual generation)

Metrics

Leakage-Adjusted Simulatability (LAS)Paraphrase flip percentageCounterfactual unfaithfulness (CF-UF)Mistake-insertion flip % (M)Simulatability (S)Aggregate normalized interpretability score

Datasets

OpenBookQA (OBQA)CommonsenseQA (CSQA)QASCStrategyQA

Benchmarks

Commonsense reasoning benchmarks