Teach a single reasoning model to switch between fast (answer-only) and slow (chain-of-thought) modes to save tokens without losing accuracy

Overview

Decision SnapshotReady For Pilot

The approach is practical: it relies on available LLMs to label trajectories and standard SFT tooling. Results span four common benchmarks and include targeted ablations validating each component.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 62%

Authors

Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Zhe Li, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OThink-R1 cuts costly reasoning tokens at inference while keeping accuracy, lowering latency and per-request compute cost for products that use step-by-step reasoning.

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

OThink-R1 trains an existing large reasoning model (LRM) to autonomously choose fast or slow thinking. The method (1) extracts patterns that mark essential vs redundant chain-of-thought (CoT) steps, (2) uses an LLM judge (GPT-4o) to label trajectories, (3) builds a hybrid SFT dataset with pruned (fast) and full (slow) examples, and (4) fine-tunes with dual KL constraints to avoid mode collapse. On four benchmarks (OpenBookQA, CommonsenseQA, ASDIV, GSM8K) it substantially cuts reasoning tokens while keeping or improving accuracy versus the base LRM.

Problem Statement

Large reasoning models always generate long step-by-step chains, which improve accuracy on hard problems but waste tokens and latency on easy ones. The paper asks: how to make a single model decide when to skip detailed reasoning and when to reason fully, saving inference cost without hurting accuracy?

Main Contribution

A hybrid training pipeline that equips one reasoning model to switch between fast (answer-only) and slow (CoT) modes automatically.

A small set of human-derived patterns that distinguish essential from redundant reasoning and an LLM-based judge (GPT-4o) to label trajectories at scale.

Key Findings

LRMs produce many more tokens than non-reasoning LLMs on common QA/math tasks.

NumbersLRMs generate on average 7.32× more tokens than non-reasoning LLMs (Table 1).

Practical UseThere is large token waste to target: pruning redundant reasoning can materially reduce inference cost for many examples.

Evidence RefAbstract / Table 1

OThink-R1 reduces reasoning tokens while maintaining or improving accuracy on evaluated benchmarks.

NumbersOpenBookQA 14B: tokens 421 vs 522 (base); accuracy 93.40% vs 92.80% (Table 2).

Practical UseIn practice you can lower token use and keep accuracy by fine-tuning with the hybrid dataset and dual-KL regularizer.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
OpenBookQA (14B)	OThink-R1 tokens 421, ACC 93.40%	DeepSeek-R1 tokens 522, ACC 92.80%	−101 tokens, +0.6 pp accuracy	OpenBookQA	Table 2 shows token and accuracy comparison	Table 2
CommonsenseQA (14B)	OThink-R1 tokens 435, ACC 81.80%	DeepSeek-R1 tokens 569, ACC 81.70%	−134 tokens, +0.1 pp accuracy	CommonsenseQA	Table 2 reports tokens and accuracy	Table 2

What To Try In 7 Days

Run LRM outputs through an LLM judge on a sample set to measure how often CoT steps are redundant.

Construct a small hybrid SFT set: keep full CoTs for clearly essential cases, prune redundant chains where immediate answers already match.

Fine-tune with a dual KL loss to maintain both reasoning and concise-answer behavior, then measure tokens and accuracy on your top-use cases.

Agent Features

Planning

intrinsic mode selection (fast vs slow)

Tool Use

LLM judge (GPT-4o) for labeling

Frameworks

SFT

Optimization Features

Token Efficiency

SFTfast-thinking mode activated for 8–37% test cases (varies by dataset/scale)

Model Optimization

fine-tune LRM to emit two generation styles

System Optimization

no extra model parameters required; behavior learned in single model

Training Optimization

SFTdual KL-divergence regularizer to anchor distributions

Inference Optimization

dynamic mode selection to skip CoT for many examples

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/AgenticIR-Lab/OThink-R1

Data URLs

https://huggingface.co/datasets/tau/commonsense_qa https://github.com/allenai/OpenBookQA https://huggingface.co/datasets/EleutherAI/asdiv https://huggingface.co/datasets/openai/gsm8k

Risks & Boundaries

Limitations

Relies on an external LLM judge (GPT-4o) to label CoT traces, adding cost and potential judge bias.

Pattern discovery depended on a small panel of senior researchers; patterns may miss edge cases.

When Not To Use

When you cannot afford the judge LLM calls during dataset construction.

When strict determinism is required and any judge-driven pruning risks unpredictable behavior.

Failure Modes

Judge mislabels essential reasoning as redundant, causing accuracy drops on harder examples.

Removing dual-KL constraints leads to mode collapse or runaway overthinking (large token blowup).

Core Entities

Models

DeepSeek-R1-Distill-Qwen-7B/14BQwen2.5-InstructGPT-4o (judge)

Metrics

tokensAccuracy

Datasets

OpenBookQACommonsenseQAASDIVGSM8K

Benchmarks

OpenBookQACommonsenseQAASDIVGSM8K

Context Entities

Models

Qwen2.5 (non-reasoning baseline)DeepSeek-R1 (reference LRM)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LRMs produce many more tokens than non-reasoning LLMs on common QA/math tasks.

OThink-R1 reduces reasoning tokens while maintaining or improving accuracy on evaluated benchmarks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

RL fine-tuning raises visual reasoning scores but weakens reasoning faithfulness and robustness to misleading text

Key finding

Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

Key finding

Build expert element-based test sets and use a chain-of-thought prompt (SumCoT) to get LLMs to write more complete news summaries

Key finding

Which LLM and reasoning setup solves Raven-style visual puzzles best?

Key finding

Embed executable code in prompts to ground LLM reasoning and cut hallucinations

Key finding