Teach a single reasoning model to switch between fast (answer-only) and slow (chain-of-thought) modes to save tokens without losing accuracy

June 3, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.62

Cost Impact Score

0.7

Citation Count

1

Authors

Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Zhe Li, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang

Links

Abstract / PDF

Why It Matters For Business

OThink-R1 cuts costly reasoning tokens at inference while keeping accuracy, lowering latency and per-request compute cost for products that use step-by-step reasoning.

Summary TLDR

OThink-R1 trains an existing large reasoning model (LRM) to autonomously choose fast or slow thinking. The method (1) extracts patterns that mark essential vs redundant chain-of-thought (CoT) steps, (2) uses an LLM judge (GPT-4o) to label trajectories, (3) builds a hybrid SFT dataset with pruned (fast) and full (slow) examples, and (4) fine-tunes with dual KL constraints to avoid mode collapse. On four benchmarks (OpenBookQA, CommonsenseQA, ASDIV, GSM8K) it substantially cuts reasoning tokens while keeping or improving accuracy versus the base LRM.

Problem Statement

Large reasoning models always generate long step-by-step chains, which improve accuracy on hard problems but waste tokens and latency on easy ones. The paper asks: how to make a single model decide when to skip detailed reasoning and when to reason fully, saving inference cost without hurting accuracy?

Main Contribution

A hybrid training pipeline that equips one reasoning model to switch between fast (answer-only) and slow (CoT) modes automatically.

A small set of human-derived patterns that distinguish essential from redundant reasoning and an LLM-based judge (GPT-4o) to label trajectories at scale.

A dual-KL fine-tuning objective that preserves both reasoning ability and efficient generation and prevents mode collapse.

Key Findings

LRMs produce many more tokens than non-reasoning LLMs on common QA/math tasks.

NumbersLRMs generate on average 7.32× more tokens than non-reasoning LLMs (Table 1).

OThink-R1 reduces reasoning tokens while maintaining or improving accuracy on evaluated benchmarks.

NumbersOpenBookQA 14B: tokens 421 vs 522 (base); accuracy 93.40% vs 92.80% (Table 2).

A large fraction of training CoT steps were judged redundant and pruned during dataset construction.

NumbersOpenBookQA prune ratio (training) 67.36% for 14B (Table 3).

Dual KL constraints are critical to avoid overthinking or loss of performance.

NumbersRemoving reference LRM KL (14B OpenBookQA) increased tokens from 421 to 10,599 (Table 4).

Results

OpenBookQA (14B)

ValueOThink-R1 tokens 421, ACC 93.40%

BaselineDeepSeek-R1 tokens 522, ACC 92.80%

CommonsenseQA (14B)

ValueOThink-R1 tokens 435, ACC 81.80%

BaselineDeepSeek-R1 tokens 569, ACC 81.70%

GSM8K (7B)

ValueOThink-R1 tokens 488, ACC 86.70%

BaselineDeepSeek-R1 tokens 719, ACC 86.10%

Prune ratio (training) / Fast-thinking ratio (test)

ValueOpenBookQA (14B): prune 67.36% / fast-thinking 36.80%

Who Should Care

What To Try In 7 Days

Run LRM outputs through an LLM judge on a sample set to measure how often CoT steps are redundant.

Construct a small hybrid SFT set: keep full CoTs for clearly essential cases, prune redundant chains where immediate answers already match.

Fine-tune with a dual KL loss to maintain both reasoning and concise-answer behavior, then measure tokens and accuracy on your top-use cases.

Agent Features

Planning

  • intrinsic mode selection (fast vs slow)

Tool Use

  • LLM judge (GPT-4o) for labeling

Frameworks

  • SFT

Optimization Features

Token Efficiency

  • SFT
  • fast-thinking mode activated for 8–37% test cases (varies by dataset/scale)

Model Optimization

  • fine-tune LRM to emit two generation styles

System Optimization

  • no extra model parameters required; behavior learned in single model

Training Optimization

  • SFT
  • dual KL-divergence regularizer to anchor distributions

Inference Optimization

  • dynamic mode selection to skip CoT for many examples

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on an external LLM judge (GPT-4o) to label CoT traces, adding cost and potential judge bias.
  • Pattern discovery depended on a small panel of senior researchers; patterns may miss edge cases.
  • Evaluations limited to four QA/math datasets; different domains may need new pattern definitions.

When Not To Use

  • When you cannot afford the judge LLM calls during dataset construction.
  • When strict determinism is required and any judge-driven pruning risks unpredictable behavior.
  • If your use case needs full CoT for auditing or legal traceability on every example.

Failure Modes

  • Judge mislabels essential reasoning as redundant, causing accuracy drops on harder examples.
  • Removing dual-KL constraints leads to mode collapse or runaway overthinking (large token blowup).
  • Patterns identified by experts may not generalize, reducing pruning safety on new datasets.

Core Entities

Models

  • DeepSeek-R1-Distill-Qwen-7B/14B
  • Qwen2.5-Instruct
  • GPT-4o (judge)

Metrics

  • tokens
  • Accuracy

Datasets

  • OpenBookQA
  • CommonsenseQA
  • ASDIV
  • GSM8K

Benchmarks

  • OpenBookQA
  • CommonsenseQA
  • ASDIV
  • GSM8K

Context Entities

Models

  • Qwen2.5 (non-reasoning baseline)
  • DeepSeek-R1 (reference LRM)