Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
You can compress LLM reasoning into smaller models while teaching them to flag bad reasoning. That lowers inference cost and raises reliability for deployed small models in edge or cost-sensitive settings.
Summary TLDR
This paper shows a practical way to distill two things from a large language model (LLM) into a much smaller model (SLM): (1) the LLM's ability to self-evaluate its chain-of-thought (CoT) reasoning, and (2) multiple diverse CoTs so the SLM learns broader reasoning. Using GPT‑3.5 as teacher and T5 variants as students, the method (5 CoTs + multiple self-evaluations per CoT, multitask training) raises accuracy across math word problems, commonsense QA, and NLI, and reduces some hallucinations compared to vanilla CoT distillation.
Problem Statement
CoT distillation transfers LLM reasoning into small models but also transfers flawed or hallucinated reasoning. Single CoTs miss alternative valid reasoning paths. The paper aims to teach small models both diverse CoTs and the teacher's self-evaluation ability so students learn when reasoning is correct and avoid inheriting errors.
Main Contribution
Distill an LLM's self-evaluation outputs so an SLM learns to judge its own CoTs and correct likely errors.
Distill multiple diverse CoTs plus corresponding self-evaluations to expose SLMs to a broader reasoning space.
Empirically show on three benchmarks that combining multiple CoTs with self-evaluation improves accuracy and reduces some hallucinations versus prior CoT-only distillation.
Key Findings
Combining five CoTs with self-evaluation improves T5-Base accuracy on SVAMP vs 1-CoT distillation
5 CoTs w/ self-evaluation yields consistent gains across tasks and label types
Distilled SLMs can learn to mimic LLM evaluation decisions with high agreement
Manual check shows net reduction in hallucinations/harmful outputs
More CoTs help up to a point; returns diminish and can hurt beyond ~7 CoTs
Results
Accuracy
Accuracy
Accuracy
performance gain vs 1 CoT (model scale)
self-evaluation agreement
Who Should Care
What To Try In 7 Days
Generate 5 CoTs and 5 self-evaluations per example from a teacher LLM (e.g., GPT‑3.5).
Train a small T5 model with a multitask prefix setup: 'predict:' for labels and 'explain:' for rationales.
Use λ≈0.5 to weight label vs rationale loss (start there, then tune).
Optimization Features
Model Optimization
- distillation
Training Optimization
- multi-task learning
- pseudo-labeling
- few-shot prompting
Reproducibility
Data Urls
- SVAMP
- CommonsenseQA
- ANLI
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments use only GPT‑3.5 as teacher; results may vary with other LLMs.
- Benchmarks limited to three tasks; generality to other task types unproven.
- Self-evaluation quality is bounded by the teacher’s own introspection and may propagate teacher biases.
When Not To Use
- When you cannot query a reliable teacher LLM for multiple CoTs and self-evaluations.
- When training cost or labeling budget forbids generating many teacher outputs.
- For tasks where reasoning diversity adds noise rather than signal (monitor validation).
Failure Modes
- Teacher self-evaluations are incorrect or biased, causing SLM to learn wrong checks.
- Too many CoTs introduce conflicting reasoning and harm student learning (>7 CoTs observed).
- Low-quality pseudo-labels reduce benefit; human labels still outperform pseudo-labels.
Core Entities
Models
- gpt-3.5-turbo
- T5-Base (220M)
- T5-Small (60M)
- T5-Large (770M)
Metrics
- Accuracy
Datasets
- SVAMP (math word problems)
- CommonsenseQA (CQA)
- ANLI (natural language inference)

