Overview
Production Readiness
0.6
Novelty Score
0.62
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
OThink-R1 cuts costly reasoning tokens at inference while keeping accuracy, lowering latency and per-request compute cost for products that use step-by-step reasoning.
Summary TLDR
OThink-R1 trains an existing large reasoning model (LRM) to autonomously choose fast or slow thinking. The method (1) extracts patterns that mark essential vs redundant chain-of-thought (CoT) steps, (2) uses an LLM judge (GPT-4o) to label trajectories, (3) builds a hybrid SFT dataset with pruned (fast) and full (slow) examples, and (4) fine-tunes with dual KL constraints to avoid mode collapse. On four benchmarks (OpenBookQA, CommonsenseQA, ASDIV, GSM8K) it substantially cuts reasoning tokens while keeping or improving accuracy versus the base LRM.
Problem Statement
Large reasoning models always generate long step-by-step chains, which improve accuracy on hard problems but waste tokens and latency on easy ones. The paper asks: how to make a single model decide when to skip detailed reasoning and when to reason fully, saving inference cost without hurting accuracy?
Main Contribution
A hybrid training pipeline that equips one reasoning model to switch between fast (answer-only) and slow (CoT) modes automatically.
A small set of human-derived patterns that distinguish essential from redundant reasoning and an LLM-based judge (GPT-4o) to label trajectories at scale.
A dual-KL fine-tuning objective that preserves both reasoning ability and efficient generation and prevents mode collapse.
Key Findings
LRMs produce many more tokens than non-reasoning LLMs on common QA/math tasks.
OThink-R1 reduces reasoning tokens while maintaining or improving accuracy on evaluated benchmarks.
A large fraction of training CoT steps were judged redundant and pruned during dataset construction.
Dual KL constraints are critical to avoid overthinking or loss of performance.
Results
OpenBookQA (14B)
CommonsenseQA (14B)
GSM8K (7B)
Prune ratio (training) / Fast-thinking ratio (test)
Who Should Care
What To Try In 7 Days
Run LRM outputs through an LLM judge on a sample set to measure how often CoT steps are redundant.
Construct a small hybrid SFT set: keep full CoTs for clearly essential cases, prune redundant chains where immediate answers already match.
Fine-tune with a dual KL loss to maintain both reasoning and concise-answer behavior, then measure tokens and accuracy on your top-use cases.
Agent Features
Planning
- intrinsic mode selection (fast vs slow)
Tool Use
- LLM judge (GPT-4o) for labeling
Frameworks
- SFT
Optimization Features
Token Efficiency
- SFT
- fast-thinking mode activated for 8–37% test cases (varies by dataset/scale)
Model Optimization
- fine-tune LRM to emit two generation styles
System Optimization
- no extra model parameters required; behavior learned in single model
Training Optimization
- SFT
- dual KL-divergence regularizer to anchor distributions
Inference Optimization
- dynamic mode selection to skip CoT for many examples
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on an external LLM judge (GPT-4o) to label CoT traces, adding cost and potential judge bias.
- Pattern discovery depended on a small panel of senior researchers; patterns may miss edge cases.
- Evaluations limited to four QA/math datasets; different domains may need new pattern definitions.
When Not To Use
- When you cannot afford the judge LLM calls during dataset construction.
- When strict determinism is required and any judge-driven pruning risks unpredictable behavior.
- If your use case needs full CoT for auditing or legal traceability on every example.
Failure Modes
- Judge mislabels essential reasoning as redundant, causing accuracy drops on harder examples.
- Removing dual-KL constraints leads to mode collapse or runaway overthinking (large token blowup).
- Patterns identified by experts may not generalize, reducing pruning safety on new datasets.
Core Entities
Models
- DeepSeek-R1-Distill-Qwen-7B/14B
- Qwen2.5-Instruct
- GPT-4o (judge)
Metrics
- tokens
- Accuracy
Datasets
- OpenBookQA
- CommonsenseQA
- ASDIV
- GSM8K
Benchmarks
- OpenBookQA
- CommonsenseQA
- ASDIV
- GSM8K
Context Entities
Models
- Qwen2.5 (non-reasoning baseline)
- DeepSeek-R1 (reference LRM)

