Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

Overview

Decision SnapshotNeeds Validation

The method is practical: reproducible code, clear recipe (5 CoTs + multiple self-evals, multitask loss), and measurable gains on three public datasets, but teacher model variety and real-world deployment tests are limited.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Weize Liu, Guocong Li, Kai Zhang, Bang Du, Qiyuan Chen, Xuming Hu, Hongxia Xu, Jintai Chen, Jian Wu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can compress LLM reasoning into smaller models while teaching them to flag bad reasoning. That lowers inference cost and raises reliability for deployed small models in edge or cost-sensitive settings.

Who Should Care

ML Engineer Data Scientist Engineering Lead Product Manager CTO

Summary TLDR

This paper shows a practical way to distill two things from a large language model (LLM) into a much smaller model (SLM): (1) the LLM's ability to self-evaluate its chain-of-thought (CoT) reasoning, and (2) multiple diverse CoTs so the SLM learns broader reasoning. Using GPT‑3.5 as teacher and T5 variants as students, the method (5 CoTs + multiple self-evaluations per CoT, multitask training) raises accuracy across math word problems, commonsense QA, and NLI, and reduces some hallucinations compared to vanilla CoT distillation.

Problem Statement

CoT distillation transfers LLM reasoning into small models but also transfers flawed or hallucinated reasoning. Single CoTs miss alternative valid reasoning paths. The paper aims to teach small models both diverse CoTs and the teacher's self-evaluation ability so students learn when reasoning is correct and avoid inheriting errors.

Main Contribution

Distill an LLM's self-evaluation outputs so an SLM learns to judge its own CoTs and correct likely errors.

Distill multiple diverse CoTs plus corresponding self-evaluations to expose SLMs to a broader reasoning space.

Key Findings

Combining five CoTs with self-evaluation improves T5-Base accuracy on SVAMP vs 1-CoT distillation

Numbers60.3% vs 51.7% (±0.6 vs ±2.1)

Practical UseUse multiple CoTs plus self-eval when distilling math reasoning: expect ≈8.6 percentage points higher accuracy on this benchmark.

Evidence RefTable 1 (T5-Base, SVAMP, pseudo-labels)

5 CoTs w/ self-evaluation yields consistent gains across tasks and label types

NumbersImprovements shown vs baselines on SVAMP/CQA/ANLI (Table 1)

Practical UseApply this distillation pattern broadly: benefits appear on math, commonsense QA, and NLI in experiments.

Evidence RefTable 1 (all datasets)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	60.3% ±0.6	1 CoT: 51.7% ±2.1	+8.6%	SVAMP (pseudo-labels)	Table 1: 5 CoTs w/ Self-Evaluation vs 1 CoT	Table 1
Accuracy	65.0% ±0.1	1 CoT: 63.4% ±0.2	+1.6%	CQA (human labels)	Table 1: 5 CoTs w/ Self-Evaluation vs 1 CoT	Table 1

What To Try In 7 Days

Generate 5 CoTs and 5 self-evaluations per example from a teacher LLM (e.g., GPT‑3.5).

Train a small T5 model with a multitask prefix setup: 'predict:' for labels and 'explain:' for rationales.

Use λ≈0.5 to weight label vs rationale loss (start there, then tune).

Optimization Features

Model Optimization

distillation

Training Optimization

multi-task learningpseudo-labelingfew-shot prompting

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Attention-is-All-I-Need/Mind-s-Mirror-Distilling-LLM

Data URLs

SVAMPCommonsenseQAANLI

Risks & Boundaries

Limitations

Experiments use only GPT‑3.5 as teacher; results may vary with other LLMs.

Benchmarks limited to three tasks; generality to other task types unproven.

When Not To Use

When you cannot query a reliable teacher LLM for multiple CoTs and self-evaluations.

When training cost or labeling budget forbids generating many teacher outputs.

Failure Modes

Teacher self-evaluations are incorrect or biased, causing SLM to learn wrong checks.

Too many CoTs introduce conflicting reasoning and harm student learning (>7 CoTs observed).

Core Entities

Models

gpt-3.5-turboT5-Base (220M)T5-Small (60M)T5-Large (770M)

Metrics

Accuracy

Datasets

SVAMP (math word problems)CommonsenseQA (CQA)ANLI (natural language inference)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Combining five CoTs with self-evaluation improves T5-Base accuracy on SVAMP vs 1-CoT distillation

5 CoTs w/ self-evaluation yields consistent gains across tasks and label types

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Distill retrieval+evidence and simple graphs from big LLMs into small LMs to cut hallucinations and inference cost

Key finding

Cut Qwen2-Audio translation models by ~40–50% storage while keeping ~97–100% quality

Key finding

Train one model to act like many agents: Chain-of-Agents (CoA) and Agent Foundation Models (AFM)

Key finding