Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

November 15, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Weize Liu, Guocong Li, Kai Zhang, Bang Du, Qiyuan Chen, Xuming Hu, Hongxia Xu, Jintai Chen, Jian Wu

Links

Abstract / PDF

Why It Matters For Business

You can compress LLM reasoning into smaller models while teaching them to flag bad reasoning. That lowers inference cost and raises reliability for deployed small models in edge or cost-sensitive settings.

Summary TLDR

This paper shows a practical way to distill two things from a large language model (LLM) into a much smaller model (SLM): (1) the LLM's ability to self-evaluate its chain-of-thought (CoT) reasoning, and (2) multiple diverse CoTs so the SLM learns broader reasoning. Using GPT‑3.5 as teacher and T5 variants as students, the method (5 CoTs + multiple self-evaluations per CoT, multitask training) raises accuracy across math word problems, commonsense QA, and NLI, and reduces some hallucinations compared to vanilla CoT distillation.

Problem Statement

CoT distillation transfers LLM reasoning into small models but also transfers flawed or hallucinated reasoning. Single CoTs miss alternative valid reasoning paths. The paper aims to teach small models both diverse CoTs and the teacher's self-evaluation ability so students learn when reasoning is correct and avoid inheriting errors.

Main Contribution

Distill an LLM's self-evaluation outputs so an SLM learns to judge its own CoTs and correct likely errors.

Distill multiple diverse CoTs plus corresponding self-evaluations to expose SLMs to a broader reasoning space.

Empirically show on three benchmarks that combining multiple CoTs with self-evaluation improves accuracy and reduces some hallucinations versus prior CoT-only distillation.

Key Findings

Combining five CoTs with self-evaluation improves T5-Base accuracy on SVAMP vs 1-CoT distillation

Numbers60.3% vs 51.7% (±0.6 vs ±2.1)

5 CoTs w/ self-evaluation yields consistent gains across tasks and label types

NumbersImprovements shown vs baselines on SVAMP/CQA/ANLI (Table 1)

Distilled SLMs can learn to mimic LLM evaluation decisions with high agreement

Numbers≈90% consistency with GPT-3.5 self-evaluations

Manual check shows net reduction in hallucinations/harmful outputs

NumbersSignificant reduction in ~7% of examined cases; tied 91%; worse <2%

More CoTs help up to a point; returns diminish and can hurt beyond ~7 CoTs

NumbersPerformance drops after >7 CoTs; chosen default = 5 CoTs

Results

Accuracy

Value60.3% ±0.6

Baseline1 CoT: 51.7% ±2.1

Accuracy

Value65.0% ±0.1

Baseline1 CoT: 63.4% ±0.2

Accuracy

Value44.3% ±0.2

Baseline1 CoT: 39.8% ±0.4

performance gain vs 1 CoT (model scale)

ValueT5-Base +8.6pts; T5-Small +10.1pts; T5-Large +3.1pts

Baseline1 CoT

self-evaluation agreement

Value≈90% agreement with GPT-3.5

BaselineSLMs without self-eval: near 0 evaluation ability

Who Should Care

What To Try In 7 Days

Generate 5 CoTs and 5 self-evaluations per example from a teacher LLM (e.g., GPT‑3.5).

Train a small T5 model with a multitask prefix setup: 'predict:' for labels and 'explain:' for rationales.

Use λ≈0.5 to weight label vs rationale loss (start there, then tune).

Optimization Features

Model Optimization

  • distillation

Training Optimization

  • multi-task learning
  • pseudo-labeling
  • few-shot prompting

Reproducibility

Data Urls

  • SVAMP
  • CommonsenseQA
  • ANLI

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments use only GPT‑3.5 as teacher; results may vary with other LLMs.
  • Benchmarks limited to three tasks; generality to other task types unproven.
  • Self-evaluation quality is bounded by the teacher’s own introspection and may propagate teacher biases.

When Not To Use

  • When you cannot query a reliable teacher LLM for multiple CoTs and self-evaluations.
  • When training cost or labeling budget forbids generating many teacher outputs.
  • For tasks where reasoning diversity adds noise rather than signal (monitor validation).

Failure Modes

  • Teacher self-evaluations are incorrect or biased, causing SLM to learn wrong checks.
  • Too many CoTs introduce conflicting reasoning and harm student learning (>7 CoTs observed).
  • Low-quality pseudo-labels reduce benefit; human labels still outperform pseudo-labels.

Core Entities

Models

  • gpt-3.5-turbo
  • T5-Base (220M)
  • T5-Small (60M)
  • T5-Large (770M)

Metrics

  • Accuracy

Datasets

  • SVAMP (math word problems)
  • CommonsenseQA (CQA)
  • ANLI (natural language inference)