Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

November 15, 20237 min

Overview

Decision SnapshotNeeds Validation

The method is practical: reproducible code, clear recipe (5 CoTs + multiple self-evals, multitask loss), and measurable gains on three public datasets, but teacher model variety and real-world deployment tests are limited.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Weize Liu, Guocong Li, Kai Zhang, Bang Du, Qiyuan Chen, Xuming Hu, Hongxia Xu, Jintai Chen, Jian Wu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can compress LLM reasoning into smaller models while teaching them to flag bad reasoning. That lowers inference cost and raises reliability for deployed small models in edge or cost-sensitive settings.

Who Should Care

Summary TLDR

This paper shows a practical way to distill two things from a large language model (LLM) into a much smaller model (SLM): (1) the LLM's ability to self-evaluate its chain-of-thought (CoT) reasoning, and (2) multiple diverse CoTs so the SLM learns broader reasoning. Using GPT‑3.5 as teacher and T5 variants as students, the method (5 CoTs + multiple self-evaluations per CoT, multitask training) raises accuracy across math word problems, commonsense QA, and NLI, and reduces some hallucinations compared to vanilla CoT distillation.

Problem Statement

CoT distillation transfers LLM reasoning into small models but also transfers flawed or hallucinated reasoning. Single CoTs miss alternative valid reasoning paths. The paper aims to teach small models both diverse CoTs and the teacher's self-evaluation ability so students learn when reasoning is correct and avoid inheriting errors.

Main Contribution

Distill an LLM's self-evaluation outputs so an SLM learns to judge its own CoTs and correct likely errors.

Distill multiple diverse CoTs plus corresponding self-evaluations to expose SLMs to a broader reasoning space.

Key Findings

Combining five CoTs with self-evaluation improves T5-Base accuracy on SVAMP vs 1-CoT distillation

Numbers60.3% vs 51.7%0.6 vs ±2.1)

Practical UseUse multiple CoTs plus self-eval when distilling math reasoning: expect ≈8.6 percentage points higher accuracy on this benchmark.

Evidence RefTable 1 (T5-Base, SVAMP, pseudo-labels)

5 CoTs w/ self-evaluation yields consistent gains across tasks and label types

NumbersImprovements shown vs baselines on SVAMP/CQA/ANLI (Table 1)

Practical UseApply this distillation pattern broadly: benefits appear on math, commonsense QA, and NLI in experiments.

Evidence RefTable 1 (all datasets)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy60.3% ±0.61 CoT: 51.7% ±2.1+8.6%SVAMP (pseudo-labels)Table 1: 5 CoTs w/ Self-Evaluation vs 1 CoTTable 1
Accuracy65.0% ±0.11 CoT: 63.4% ±0.2+1.6%CQA (human labels)Table 1: 5 CoTs w/ Self-Evaluation vs 1 CoTTable 1

What To Try In 7 Days

Generate 5 CoTs and 5 self-evaluations per example from a teacher LLM (e.g., GPT‑3.5).

Train a small T5 model with a multitask prefix setup: 'predict:' for labels and 'explain:' for rationales.

Use λ≈0.5 to weight label vs rationale loss (start there, then tune).

Optimization Features

Model Optimization
distillation
Training Optimization
multi-task learningpseudo-labelingfew-shot prompting

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

SVAMPCommonsenseQAANLI

Risks & Boundaries

Limitations

Experiments use only GPT‑3.5 as teacher; results may vary with other LLMs.

Benchmarks limited to three tasks; generality to other task types unproven.

When Not To Use

When you cannot query a reliable teacher LLM for multiple CoTs and self-evaluations.

When training cost or labeling budget forbids generating many teacher outputs.

Failure Modes

Teacher self-evaluations are incorrect or biased, causing SLM to learn wrong checks.

Too many CoTs introduce conflicting reasoning and harm student learning (>7 CoTs observed).

Core Entities

Models

gpt-3.5-turboT5-Base (220M)T5-Small (60M)T5-Large (770M)

Metrics

Accuracy

Datasets

SVAMP (math word problems)CommonsenseQA (CQA)ANLI (natural language inference)