Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
5
Why It Matters For Business
Better explanation selection improves trust and makes model outputs more useful for training smaller systems and auditing model decisions.
Summary TLDR
The paper evaluates how well Chain-of-Thought (CoT) and related prompting methods produce usable, faithful, and robust reasoning explanations from one LLM. It introduces Self-Entailment-Alignment CoT (SEA-CoT): generate multiple reasoning traces, then pick the one that best entails the question+answer and overlaps key tokens. On three commonsense datasets SEA-CoT improves aggregate interpretability vs several baselines, raises simulatability for student models, and reduces counterfactual unfaithfulness. Code is provided.
Problem Statement
Researchers often judge generated CoT explanations only by faithfulness. That misses other practical traits like robustness to wording changes and utility for teaching smaller models. We need a broad, actionable evaluation and a simple method to pick more interpretable explanations from sampled chains.
Main Contribution
Define a three-part interpretability evaluation: faithfulness, robustness, and utility (simulatability).
Propose SEA-CoT: rank sampled CoT traces by entailment to (question+answer) and token overlap, selecting the most aligned trace.
Evaluate multiple prompting styles (CoT, Self-Consistency, QD, Self-Refine) across 3 commonsense datasets using a 70B Llama-2 quantized model.
Key Findings
SEA-CoT wins on aggregate interpretability across prompts and datasets.
Selecting explanations by both entailment and token overlap reduces counterfactual unfaithfulness and raises simulatability.
Larger models generally produce more interpretable chains, but SEA-CoT helps smaller ones too.
Results
OBQA aggregate interpretability (SEA-CoT vs baselines)
StrategyQA ablation (SEA-CoT O&E)
Model size effect on simulatability
Who Should Care
What To Try In 7 Days
Sample N=10 CoT traces and rerank by entailment+token overlap (SEA-CoT) before choosing an explanation.
Measure simulatability: append explanations to inputs and fine-tune a small student (e.g., T5-base) to test LAS gains.
Run quick robustness checks: paraphrase and insert a small mistake to see if answers flip often.
Optimization Features
Model Optimization
- GPTQ 4-bit post-training quantization to run Llama-2 70B locally
Inference Optimization
- Use Huggingface text-generation-inference for faster serving
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Experiments use only Llama-2 family (quantized); results may differ on other architectures.
- Perturbation generation relied on GPT-3.5/4 for paraphrase/mistake/counterfactual edits, adding potential bias.
- SEA-CoT requires sampling multiple chains (N) which increases generation cost and latency.
- Evaluation focuses on commonsense QA; results may not generalize to knowledge-intensive or domain-specific tasks.
When Not To Use
- When you need grounded factual checks via external retrieval — SEA-CoT ranks internal alignment, not external truth.
- Where latency or token cost forbids sampling dozens of chains.
- If you cannot compute entailment reliably on your model family (small models may mis-score entailment).
Failure Modes
- SEA-CoT can favor plausible but factually incorrect chains if the model's entailment scorer hallucinates.
- High task accuracy can reduce sensitivity of mistake-insertion tests, masking unfaithfulness.
- Larger N improves utility but can hurt robustness metrics; ranking may trade off different interpretability goals.
Core Entities
Models
- Llama-2 70B
- Llama-2 13B
- Llama-2 7B
- GPT-3.5 (used for perturbations)
- GPT-4 (used for counterfactual generation)
Metrics
- Leakage-Adjusted Simulatability (LAS)
- Paraphrase flip percentage
- Counterfactual unfaithfulness (CF-UF)
- Mistake-insertion flip % (M)
- Simulatability (S)
- Aggregate normalized interpretability score
Datasets
- OpenBookQA (OBQA)
- CommonsenseQA (CSQA)
- QASC
- StrategyQA
Benchmarks
- Commonsense reasoning benchmarks

