Overview
The approach is practical: it relies on available LLMs to label trajectories and standard SFT tooling. Results span four common benchmarks and include targeted ablations validating each component.
Citations1
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 62%
Why It Matters For Business
OThink-R1 cuts costly reasoning tokens at inference while keeping accuracy, lowering latency and per-request compute cost for products that use step-by-step reasoning.
Who Should Care
Summary TLDR
OThink-R1 trains an existing large reasoning model (LRM) to autonomously choose fast or slow thinking. The method (1) extracts patterns that mark essential vs redundant chain-of-thought (CoT) steps, (2) uses an LLM judge (GPT-4o) to label trajectories, (3) builds a hybrid SFT dataset with pruned (fast) and full (slow) examples, and (4) fine-tunes with dual KL constraints to avoid mode collapse. On four benchmarks (OpenBookQA, CommonsenseQA, ASDIV, GSM8K) it substantially cuts reasoning tokens while keeping or improving accuracy versus the base LRM.
Problem Statement
Large reasoning models always generate long step-by-step chains, which improve accuracy on hard problems but waste tokens and latency on easy ones. The paper asks: how to make a single model decide when to skip detailed reasoning and when to reason fully, saving inference cost without hurting accuracy?
Main Contribution
A hybrid training pipeline that equips one reasoning model to switch between fast (answer-only) and slow (CoT) modes automatically.
A small set of human-derived patterns that distinguish essential from redundant reasoning and an LLM-based judge (GPT-4o) to label trajectories at scale.
Key Findings
LRMs produce many more tokens than non-reasoning LLMs on common QA/math tasks.
OThink-R1 reduces reasoning tokens while maintaining or improving accuracy on evaluated benchmarks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| OpenBookQA (14B) | OThink-R1 tokens 421, ACC 93.40% | DeepSeek-R1 tokens 522, ACC 92.80% | −101 tokens, +0.6 pp accuracy | OpenBookQA | Table 2 shows token and accuracy comparison | Table 2 |
| CommonsenseQA (14B) | OThink-R1 tokens 435, ACC 81.80% | DeepSeek-R1 tokens 569, ACC 81.70% | −134 tokens, +0.1 pp accuracy | CommonsenseQA | Table 2 reports tokens and accuracy | Table 2 |
What To Try In 7 Days
Run LRM outputs through an LLM judge on a sample set to measure how often CoT steps are redundant.
Construct a small hybrid SFT set: keep full CoTs for clearly essential cases, prune redundant chains where immediate answers already match.
Fine-tune with a dual KL loss to maintain both reasoning and concise-answer behavior, then measure tokens and accuracy on your top-use cases.
Agent Features
Planning
Tool Use
Frameworks
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Relies on an external LLM judge (GPT-4o) to label CoT traces, adding cost and potential judge bias.
Pattern discovery depended on a small panel of senior researchers; patterns may miss edge cases.
When Not To Use
When you cannot afford the judge LLM calls during dataset construction.
When strict determinism is required and any judge-driven pruning risks unpredictable behavior.
Failure Modes
Judge mislabels essential reasoning as redundant, causing accuracy drops on harder examples.
Removing dual-KL constraints leads to mode collapse or runaway overthinking (large token blowup).

