Teach small models by staging structured debates with stronger models and distilling the debate trees

June 4, 20257 min

Overview

Production Readiness

0.65

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Xiaofeng Zhou, Heyan Huang, Lizi Liao

Links

Abstract / PDF

Why It Matters For Business

D&R lets you compress reasoning skills from expensive LLMs into smaller, cheaper models and reduce per-query token cost, enabling lower deployment cost and faster inference without manual human feedback loops.

Summary TLDR

The paper presents D&R: a pipeline that runs multi-turn debates between a small student model and stronger teacher models, records the debate as a Multi-Agent Interaction Graph (MAG), converts interactions into hierarchical preference trees, and trains the student by supervised fine-tuning (SFT) followed by Tree-structured Direct Preference Optimization (T-DPO). On MMLU-Pro and MATH, D&R raised a 7B student model's average accuracy from 23.98 to 38.16 and reduced per-task token cost, while ablations show self-reflection and teacher feedback in debate data are crucial.

Problem Statement

Large language models excel but cost too much. Existing distillation or feedback methods either lack iterative, targeted teacher guidance or are too costly to scale. The problem: how to efficiently transfer deep reasoning and correction behaviors from strong models into smaller models so they gain lasting improvements and lower inference cost.

Main Contribution

A Debate & Reflect (D&R) pipeline where a student debates with multiple teacher models and collects responses, self-reflection, and teacher feedback.

Tree-structured Direct Preference Optimization (T-DPO), which turns debate logs into hierarchical preference trees for preference-based fine-tuning.

Empirical demonstration that SFT + T-DPO on debate-derived data improves small models' accuracy and token-efficiency versus single-teacher and prior multi-teacher baselines.

Key Findings

D&R raised the average accuracy of Mistral-7B-Instruct from 23.98 to 38.16 on evaluated benchmarks.

Numbersavg +14.18 pts (23.98 -> 38.16)

D&R outperformed the best single-teacher distillation baseline by about 2.95 average points.

Numbersavg +2.95 pts (35.21 -> 38.16)

Distillation with D&R reduced inference token cost per problem versus the original model.

Numbersavg tokens reduced 627.10 -> 528.83 (−98.27 tokens)

Removing self-reflection or teacher feedback from debate data hurts final accuracy.

Numbersablation loss up to −5.54 pts

Student capacity limits gains on hard reasoning tasks like MATH; a larger student learned more.

NumbersMistral MATH after D&R 17.32 vs Llama-3.1-8B after D&R 48.02

Results

Accuracy

Value38.16

Baseline23.98 (No Distillation)

Accuracy

Value17.32

Baseline8.02 (No Distillation)

Best single-teacher distillation average

Value35.21

BaselineMistral no-distill 23.98

Per-problem token cost (avg)

Value528.83 tokens

Baseline627.10 tokens (Mistral baseline)

Who Should Care

What To Try In 7 Days

Generate debates between your strong model(s) and a target small model on 100–300 representative tasks.

Record MAGs, extract root->chosen/rejected response pairs, and build simple preference trees.

Apply SFT on correct answers, then run a preference-based optimization (DPO/T-DPO) with LoRA on the student model and measure accuracy + token cost.

Agent Features

Memory

  • Multi-Agent Interaction Graph (MAG) records short-term debate history

Planning

  • multi-turn debate rounds
  • iterative correction via self-reflection

Frameworks

  • D&R (Debate & Reflect)
  • T-DPO
  • MAG

Is Agentic

true

Collaboration

  • multi-teacher debate and student participation

Optimization Features

Token Efficiency

  • measured token reduction (≈98 tokens avg per problem)

Model Optimization

  • preference-based fine-tuning (T-DPO)
  • SFT

Training Optimization

  • constructing preference trees from MAGs
  • LoRA

Inference Optimization

  • distillation reduces token cost per problem
  • learned self-correction at inference reduces multi-call debate

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations focus on final-answer correctness, not full verification of intermediate reasoning steps.
  • Study targets knowledge and reasoning tasks only; other task families are not evaluated.

When Not To Use

  • If you require verified step-by-step process proofs rather than final-answer accuracy.
  • When the target student model is too small to represent complex reasoning (e.g., Mistral struggled on MATH).

Failure Modes

  • Student may learn to match final answers without valid reasoning, producing plausible but unsound chain-of-thought.
  • Quality depends on teacher models; poor teacher feedback will propagate errors.
  • Preference objectives may be task-dependent (RPO/T-DPO behaved differently across categories).

Core Entities

Models

  • Mistral-7B-Instruct
  • Llama-3.1-8B-Instruct
  • gpt-4o
  • claude-3.5
  • gemini-1.5-pro

Metrics

  • Accuracy
  • token cost (tokens per problem)

Datasets

  • MMLU Pro (computer science, physics, biology)
  • MATH

Benchmarks

  • MMLU Pro
  • MATH

Context Entities

Models

  • GPT-4o
  • Claude 3.5
  • Gemini 1.5 Pro

Datasets

  • MMLU (original)
  • Other standard reasoning benchmarks (referenced)