Overview
Method is simple and practical: rank sampled chains by self-computed entailment and overlap. Results are reproducible but were run on a single LLM family and use external APIs for perturbations.
Citations5
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Better explanation selection improves trust and makes model outputs more useful for training smaller systems and auditing model decisions.
Who Should Care
Summary TLDR
The paper evaluates how well Chain-of-Thought (CoT) and related prompting methods produce usable, faithful, and robust reasoning explanations from one LLM. It introduces Self-Entailment-Alignment CoT (SEA-CoT): generate multiple reasoning traces, then pick the one that best entails the question+answer and overlaps key tokens. On three commonsense datasets SEA-CoT improves aggregate interpretability vs several baselines, raises simulatability for student models, and reduces counterfactual unfaithfulness. Code is provided.
Problem Statement
Researchers often judge generated CoT explanations only by faithfulness. That misses other practical traits like robustness to wording changes and utility for teaching smaller models. We need a broad, actionable evaluation and a simple method to pick more interpretable explanations from sampled chains.
Main Contribution
Define a three-part interpretability evaluation: faithfulness, robustness, and utility (simulatability).
Propose SEA-CoT: rank sampled CoT traces by entailment to (question+answer) and token overlap, selecting the most aligned trace.
Key Findings
SEA-CoT wins on aggregate interpretability across prompts and datasets.
Selecting explanations by both entailment and token overlap reduces counterfactual unfaithfulness and raises simulatability.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| OBQA aggregate interpretability (SEA-CoT vs baselines) | >75% improvement (aggregate) on OBQA | other prompting baselines (CoT, SC-CoT, QD, SR) | >75% | OBQA test | Section 6.1, Figure 5 | Figure 5 |
| StrategyQA ablation (SEA-CoT O&E) | Para 1.2, CF-UF 3.81, M 61.24, S 16.97 | Random selection: Para 6.1, CF-UF 6.44, M 62.17, S 11.87 | CF-UF -2.63 (41% relative reduction); S +5.1 | StrategyQA test | Table 1 ablation results | Table 1 |
What To Try In 7 Days
Sample N=10 CoT traces and rerank by entailment+token overlap (SEA-CoT) before choosing an explanation.
Measure simulatability: append explanations to inputs and fine-tune a small student (e.g., T5-base) to test LAS gains.
Run quick robustness checks: paraphrase and insert a small mistake to see if answers flip often.
Optimization Features
Model Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments use only Llama-2 family (quantized); results may differ on other architectures.
Perturbation generation relied on GPT-3.5/4 for paraphrase/mistake/counterfactual edits, adding potential bias.
When Not To Use
When you need grounded factual checks via external retrieval — SEA-CoT ranks internal alignment, not external truth.
Where latency or token cost forbids sampling dozens of chains.
Failure Modes
SEA-CoT can favor plausible but factually incorrect chains if the model's entailment scorer hallucinates.
High task accuracy can reduce sensitivity of mistake-insertion tests, masking unfaithfulness.

