Teach a single reasoning model to switch between fast (answer-only) and slow (chain-of-thought) modes to save tokens without losing accuracy

June 3, 20257 min

Overview

Decision SnapshotReady For Pilot

The approach is practical: it relies on available LLMs to label trajectories and standard SFT tooling. Results span four common benchmarks and include targeted ablations validating each component.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 62%

Authors

Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Zhe Li, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OThink-R1 cuts costly reasoning tokens at inference while keeping accuracy, lowering latency and per-request compute cost for products that use step-by-step reasoning.

Who Should Care

Summary TLDR

OThink-R1 trains an existing large reasoning model (LRM) to autonomously choose fast or slow thinking. The method (1) extracts patterns that mark essential vs redundant chain-of-thought (CoT) steps, (2) uses an LLM judge (GPT-4o) to label trajectories, (3) builds a hybrid SFT dataset with pruned (fast) and full (slow) examples, and (4) fine-tunes with dual KL constraints to avoid mode collapse. On four benchmarks (OpenBookQA, CommonsenseQA, ASDIV, GSM8K) it substantially cuts reasoning tokens while keeping or improving accuracy versus the base LRM.

Problem Statement

Large reasoning models always generate long step-by-step chains, which improve accuracy on hard problems but waste tokens and latency on easy ones. The paper asks: how to make a single model decide when to skip detailed reasoning and when to reason fully, saving inference cost without hurting accuracy?

Main Contribution

A hybrid training pipeline that equips one reasoning model to switch between fast (answer-only) and slow (CoT) modes automatically.

A small set of human-derived patterns that distinguish essential from redundant reasoning and an LLM-based judge (GPT-4o) to label trajectories at scale.

Key Findings

LRMs produce many more tokens than non-reasoning LLMs on common QA/math tasks.

NumbersLRMs generate on average 7.32× more tokens than non-reasoning LLMs (Table 1).

Practical UseThere is large token waste to target: pruning redundant reasoning can materially reduce inference cost for many examples.

Evidence RefAbstract / Table 1

OThink-R1 reduces reasoning tokens while maintaining or improving accuracy on evaluated benchmarks.

NumbersOpenBookQA 14B: tokens 421 vs 522 (base); accuracy 93.40% vs 92.80% (Table 2).

Practical UseIn practice you can lower token use and keep accuracy by fine-tuning with the hybrid dataset and dual-KL regularizer.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
OpenBookQA (14B)OThink-R1 tokens 421, ACC 93.40%DeepSeek-R1 tokens 522, ACC 92.80%−101 tokens, +0.6 pp accuracyOpenBookQATable 2 shows token and accuracy comparisonTable 2
CommonsenseQA (14B)OThink-R1 tokens 435, ACC 81.80%DeepSeek-R1 tokens 569, ACC 81.70%−134 tokens, +0.1 pp accuracyCommonsenseQATable 2 reports tokens and accuracyTable 2

What To Try In 7 Days

Run LRM outputs through an LLM judge on a sample set to measure how often CoT steps are redundant.

Construct a small hybrid SFT set: keep full CoTs for clearly essential cases, prune redundant chains where immediate answers already match.

Fine-tune with a dual KL loss to maintain both reasoning and concise-answer behavior, then measure tokens and accuracy on your top-use cases.

Agent Features

Planning
intrinsic mode selection (fast vs slow)
Tool Use
LLM judge (GPT-4o) for labeling
Frameworks
SFT

Optimization Features

Token Efficiency
SFTfast-thinking mode activated for 8–37% test cases (varies by dataset/scale)
Model Optimization
fine-tune LRM to emit two generation styles
System Optimization
no extra model parameters required; behavior learned in single model
Training Optimization
SFTdual KL-divergence regularizer to anchor distributions
Inference Optimization
dynamic mode selection to skip CoT for many examples

Reproducibility

Risks & Boundaries

Limitations

Relies on an external LLM judge (GPT-4o) to label CoT traces, adding cost and potential judge bias.

Pattern discovery depended on a small panel of senior researchers; patterns may miss edge cases.

When Not To Use

When you cannot afford the judge LLM calls during dataset construction.

When strict determinism is required and any judge-driven pruning risks unpredictable behavior.

Failure Modes

Judge mislabels essential reasoning as redundant, causing accuracy drops on harder examples.

Removing dual-KL constraints leads to mode collapse or runaway overthinking (large token blowup).

Core Entities

Models

DeepSeek-R1-Distill-Qwen-7B/14BQwen2.5-InstructGPT-4o (judge)

Metrics

tokensAccuracy

Datasets

OpenBookQACommonsenseQAASDIVGSM8K

Benchmarks

OpenBookQACommonsenseQAASDIVGSM8K

Context Entities

Models

Qwen2.5 (non-reasoning baseline)DeepSeek-R1 (reference LRM)