SEA-CoT: pick self-entailment aligned chain-of-thoughts to make explanations more faithful, robust and useful

February 19, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

5

Authors

Wei Jie Yeo, Ranjan Satapathy, Rick Siow Mong Goh, Erik Cambria

Links

Abstract / PDF

Why It Matters For Business

Better explanation selection improves trust and makes model outputs more useful for training smaller systems and auditing model decisions.

Summary TLDR

The paper evaluates how well Chain-of-Thought (CoT) and related prompting methods produce usable, faithful, and robust reasoning explanations from one LLM. It introduces Self-Entailment-Alignment CoT (SEA-CoT): generate multiple reasoning traces, then pick the one that best entails the question+answer and overlaps key tokens. On three commonsense datasets SEA-CoT improves aggregate interpretability vs several baselines, raises simulatability for student models, and reduces counterfactual unfaithfulness. Code is provided.

Problem Statement

Researchers often judge generated CoT explanations only by faithfulness. That misses other practical traits like robustness to wording changes and utility for teaching smaller models. We need a broad, actionable evaluation and a simple method to pick more interpretable explanations from sampled chains.

Main Contribution

Define a three-part interpretability evaluation: faithfulness, robustness, and utility (simulatability).

Propose SEA-CoT: rank sampled CoT traces by entailment to (question+answer) and token overlap, selecting the most aligned trace.

Evaluate multiple prompting styles (CoT, Self-Consistency, QD, Self-Refine) across 3 commonsense datasets using a 70B Llama-2 quantized model.

Key Findings

SEA-CoT wins on aggregate interpretability across prompts and datasets.

NumbersSEA-CoT >75% aggregate improvement on OBQA vs baselines

Selecting explanations by both entailment and token overlap reduces counterfactual unfaithfulness and raises simulatability.

NumbersStrategyQA ablation: CF-UF 3.81 -> vs Random 6.44; Simu 16.97

Larger models generally produce more interpretable chains, but SEA-CoT helps smaller ones too.

Numbers70B Simu 16.97 vs 13B Simu 6.16 (Table 2)

Results

OBQA aggregate interpretability (SEA-CoT vs baselines)

Value>75% improvement (aggregate) on OBQA

Baselineother prompting baselines (CoT, SC-CoT, QD, SR)

StrategyQA ablation (SEA-CoT O&E)

ValuePara 1.2, CF-UF 3.81, M 61.24, S 16.97

BaselineRandom selection: Para 6.1, CF-UF 6.44, M 62.17, S 11.87

Model size effect on simulatability

Value70B S=16.97; 13B S=6.16; 7B S=15.97

Baseline13B

Who Should Care

What To Try In 7 Days

Sample N=10 CoT traces and rerank by entailment+token overlap (SEA-CoT) before choosing an explanation.

Measure simulatability: append explanations to inputs and fine-tune a small student (e.g., T5-base) to test LAS gains.

Run quick robustness checks: paraphrase and insert a small mistake to see if answers flip often.

Optimization Features

Model Optimization

  • GPTQ 4-bit post-training quantization to run Llama-2 70B locally

Inference Optimization

  • Use Huggingface text-generation-inference for faster serving

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Experiments use only Llama-2 family (quantized); results may differ on other architectures.
  • Perturbation generation relied on GPT-3.5/4 for paraphrase/mistake/counterfactual edits, adding potential bias.
  • SEA-CoT requires sampling multiple chains (N) which increases generation cost and latency.
  • Evaluation focuses on commonsense QA; results may not generalize to knowledge-intensive or domain-specific tasks.

When Not To Use

  • When you need grounded factual checks via external retrieval — SEA-CoT ranks internal alignment, not external truth.
  • Where latency or token cost forbids sampling dozens of chains.
  • If you cannot compute entailment reliably on your model family (small models may mis-score entailment).

Failure Modes

  • SEA-CoT can favor plausible but factually incorrect chains if the model's entailment scorer hallucinates.
  • High task accuracy can reduce sensitivity of mistake-insertion tests, masking unfaithfulness.
  • Larger N improves utility but can hurt robustness metrics; ranking may trade off different interpretability goals.

Core Entities

Models

  • Llama-2 70B
  • Llama-2 13B
  • Llama-2 7B
  • GPT-3.5 (used for perturbations)
  • GPT-4 (used for counterfactual generation)

Metrics

  • Leakage-Adjusted Simulatability (LAS)
  • Paraphrase flip percentage
  • Counterfactual unfaithfulness (CF-UF)
  • Mistake-insertion flip % (M)
  • Simulatability (S)
  • Aggregate normalized interpretability score

Datasets

  • OpenBookQA (OBQA)
  • CommonsenseQA (CSQA)
  • QASC
  • StrategyQA

Benchmarks

  • Commonsense reasoning benchmarks