Chain-of-Thought Papers — Parsed & Scored for Practitioners

Two-stage multimodal chain-of-thought lets sub‑1B models reason with images and text

0.60

0.45

96

You can get near-state-of-the-art multimodal reasoning with lightweight models by fine-tuning in two stages and fusing image features—this reduces hallucination and lowers compute cost versus running large multimodal LLMs.

Key finding

Predicting a rationale before the answer hurts small-model accuracy in one-stage text-only training.

Numbers: No-CoT 81.63% vs Reasoning 69.32% (↓12.31pp)

Let multiple copies of an LLM debate to improve reasoning and reduce hallucinations

0.50

0.60

0.40

85

If accuracy matters more than latency, running several LLM copies that debate can materially reduce wrong answers and hallucinations, producing higher-quality outputs for QA, math, and plan generation.

Key finding

Multiagent debate raises arithmetic accuracy from 67.0% to 81.8% on their test set.

Numbers: Arithmetic: 67.0% → 81.8% (Table 1)

Survey: where multimodal LLMs stand on reasoning, benchmarks, training recipes, and gaps

0.40

0.30

0.40

19

If your product needs reliable multimodal reasoning (e.g., visual QA, robotics planning), current models vary widely; instruction tuning and careful training stages materially improve results, but proprietary models still lead.

Key finding

Proprietary multimodal models outperform open-source models on reasoning-focused benchmarks.

Numbers: InfiMM-Eval overall: GPT-4V 74.44 vs InfiMM-LLaMA-13B 40.7

Make LLMs think in program structures to improve code generation

0.70

0.60

0.50

17

SCoT is a low-cost change to prompts that raises real-world code suggestion quality and developer satisfaction, so product teams can get better auto-generated code without model retraining.

Key finding

SCoT increases strict correctness (Pass@1) on HumanEval vs CoT prompting.

Numbers: Pass@1 +13.79% (CoT 53.29 → SCoT 60.64)

An open, continuously updated leaderboard that measures LLM multi-step reasoning using chain-of-thought prompts

0.60

0.40

0.60

15

Reasoning capability separates good conversational models from ones that can solve multi-step tasks; measuring it helps pick models for products that need math, code, or multi-step decisions.

Key finding

Reasoning performance scales with model size.

Numbers: GSM8k: GPT-4 92.0 vs LLaMA-65B 50.9

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

0.60

0.70

0.50

11

MindMap makes LLM outputs more factual and inspectable by forcing the model to reason over KG-derived evidence graphs; this reduces hallucination risk in knowledge-heavy apps like medical assistants and increases trust.

Key finding

MindMap improves semantic match (BERTScore F1) on a clinical QA set (GenMedGPT-5k) compared to GPT-3.5 and other retrievers.

Numbers: BERTScore F1: MindMap 0.7954 vs GPT-3.5 0.7800 (Table 2)

Use simple logic checks to make zero‑shot chain-of-thought answers more reliable

0.60

0.40

0.30

11

LoT is a low‑effort prompting add‑on that raises reasoning accuracy on strong LLMs; use it when correctness matters and you can afford extra API calls.

Key finding

Adpt‑LoT improves zero‑shot CoT accuracy on math and reasoning tasks for strong models

Numbers: GSM8K: 78.75 → 80.15 (+1.40% abs); AQuA: 57.09 → 60.63 (+3.54% abs)

TextStarCraft II: a text-based StarCraft II benchmark and a Chain-of-Summarization (CoS) method that helps LLMs plan in real time

0.40

0.60

0.50

10

TextStarCraft II and CoS show that LLMs can handle high-level, time-sensitive strategy where visual micro-control is scripted; this enables low-cost experimentation with strategic agents and rapid prototyping of language-driven decision systems.

Key finding

Closed-source LLMs using full CoS beat the level-5 built-in AI in many trials.

Numbers: GPT-4: 12/20 wins, GPT3.5: 11/20 (Table 1)

Use an LLM to judge each reasoning step and guide stochastic beam search to reduce error accumulation

0.60

0.40

10

If your product relies on long multi-step reasoning (math QA, multi-hop extraction, program generation), adding a step-level LLM evaluator can raise correctness without model fine-tuning; expect higher compute and need for token-level access.

Key finding

Self-evaluation guided decoding improves few-shot accuracy on major benchmarks with Codex backbone

Numbers: GSM8K +6.34%; AQuA +9.56%; StrategyQA +5.46%

Have LLMs 'think about their thinking' to boost understanding on NLU tasks

0.60

0.70

0.30

9

Metacognitive Prompting is a low‑cost way to improve model understanding on domain text (law, medicine) without retraining; expect modest average gains and larger wins on specialized datasets.

Key finding

MP gives a consistent aggregate performance uplift over CoT in zero‑shot settings.

Numbers: Relative boost 4.8%–6.4% vs CoT (zero‑shot, averaged across models)

Survey of practical methods to improve reasoning in large language models

0.60

0.40

0.50

8

Better reasoning reduces wrong conclusions, lowers downstream verification cost, and enables LLMs to be used in higher-stakes workflows like finance, legal, and scientific support.

Key finding

Chain-of-Thought prompting helps multi-step problems by making the model emit intermediate steps.

XLT: a short, language-independent prompt template that boosts non‑English LLM performance

0.75

0.45

0.85

8

XLT is a low-cost way to lift non-English performance and narrow cross-language gaps without retraining models, making multilingual features cheaper and faster to deploy.

Key finding

XLT substantially improves arithmetic reasoning (MGSM) in zero-shot.

Numbers: text-davinci-003 MGSM zero-shot: 12.5 → 23.9 (+11.4)

AlphaFin dataset + Stock-Chain: a RAG-enabled LLM system for stock prediction and financial Q&A

0.50

0.40

0.60

7

Combining a domain-tuned LLM with retrieval of up-to-date reports and news can improve decision-support outputs and backtested portfolio returns compared to off-the-shelf models on this dataset.

Key finding

Stock-Chain achieved substantially higher backtested annualized return than baselines.

Numbers: ARR 30.8% for Stock-Chain vs 17.5% for FinGPT

How LLMs are being used to build game-playing agents: memory, reasoning, perception, and multi-agent design

0.40

0.60

0.50

6

Game agents are a practical lab for building interactive AI: solutions for memory, robust reasoning, and hybrid control transfer to real automation, simulations, and multi-agent coordination systems used in product testing and virtual worlds.

Key finding

Carrying the previous step's thought into the next prompt (LastThoughts) raises win rate and cuts short-term inconsistent actions.

Numbers: Win rate 0.4217 → 0.4667; consecutive switch rate 0.2442 → 0.0861

Survey of how LLMs reason strategically in multi-agent games, economics, and social simulations

0.40

0.30

6

LLM-driven agents can model multi-party dynamics (negotiations, markets, simulations) and improve decision-making, but measurement and domain alignment matter more than raw model size.

Key finding

LLM strategic work spans four scenario families: societal, economic, game-theory, and gaming.

Numbers: 4 scenario categories

Aloe: open 7B–8B medical LLMs using synthetic Chain-of-Thought, model merging and Direct Preference Optimization

0.60

0.50

0.60

6

Aloe shows practical, low‑cost ways to push open medical LLMs: generate CoT examples with a stronger model, merge fine‑tuned variants, and use retrieval‑style prompting to get 2–7 point accuracy gains without larger models or expensive pretraining.

Key finding

Aloe's aligned 8B variant outperforms Llama‑3‑8B‑Instruct across medical benchmarks at this size.

Numbers: Zero‑shot avg: 70.25 vs 68.89 (Llama‑3‑8B) — Table 3

A unified survey that frames reasoning and hallucination as internal consistency problems and presents a Self-Feedback framework

0.60

6

Self-Feedback techniques can reduce contradictory or hallucinated outputs without large model scale-ups, improving reliability for customer-facing QA and code assistants.

Key finding

Self-consistency-style sampling plus majority voting can raise reasoning accuracy on math benchmarks.

Numbers: GSM8K accuracy up ≈ 17.9%

EvEval benchmark shows LLMs know single events but struggle with event similarity, temporality, and script prediction

0.40

0.50

0.30

6

If your product relies on event reasoning (timelines, forecasting, causal diagnosis), off-the-shelf LLMs can detect plausible single events and causal intent but will likely fail on timeline accuracy, counterfactual edits, and script forecasting—test with EVEVAL before deployment.

Key finding

LLMs learn single-event plausibility well but struggle to judge semantic similarity between events.

Numbers: ChatGPT: DTFit 91.43% vs HardExt 65.44%

ReAct's gains come from example-task similarity, not true stepwise reasoning

0.30

0.40

0.20

5

If you use ReAct-style prompts to power agentic workflows, expect brittle behavior: gains often come from near-identical examples, not true planning, which limits scalability and reliability.

Key finding

Interleaving reasoning with actions is not necessary for better performance.

Numbers: GPT-3.5-Turbo: 27.6% → 46.6%; GPT-3.5-Instruct: 44.7% → 61.9% (Base → Exemplar-CoT)

SEA-CoT: pick self-entailment aligned chain-of-thoughts to make explanations more faithful, robust and useful

0.60

0.50

0.40

5

Better explanation selection improves trust and makes model outputs more useful for training smaller systems and auditing model decisions.

Key finding

SEA-CoT wins on aggregate interpretability across prompts and datasets.

Numbers: SEA-CoT >75% aggregate improvement on OBQA vs baselines

Make LLM reasoning cite short knowledge triples and verify them to reduce hallucination

0.60

0.55

0.30

4

CoK makes LLM outputs more checkable and reduces hallucination by forcing explicit evidence and automated verification, which improves trust for QA and decision tasks using LLM APIs.

Key finding

CoK improves reasoning accuracy over CoT on commonsense benchmarks.

Numbers: CSQA +2.8pp (76.5→79.3) gpt-3.5-turbo (Table 1)

900-question spatial benchmark finds gpt-4o leads; Chain-of-Thought and one-shot prompts can sharply boost performance

0.60

0.40

0.60

4

Model choice and prompt style strongly change outcomes on spatial tasks; careful selection plus prompt tuning turns unusable answers into operable outputs.

Key finding

gpt-4o leads in zero-shot overall accuracy across the 900-question benchmark.

Numbers: gpt-4o WA = 71.3% (Table 1)

CausalBench: a 15‑dataset benchmark to measure LLM causal learning from correlation to full causal graphs

0.40

0.30

0.50

4

If you plan causal discovery at realistic scale, don't rely solely on LLMs — they can help with small problems and chain reasoning but miss structure on large sparse graphs and add many false edges.

Key finding

LLMs underperform classical and SOTA causal algorithms on medium and large graphs.

Numbers: At >50 nodes LLM methods often achieve <50% of classical/SOTA performance (reported averages).

Make LLMs argue: multi-model round-table + confidence-weighted voting improves reasoning

0.60

0.70

0.40

4

Combining multiple different LLMs in short, guided discussions yields consistent accuracy lifts on many reasoning tasks; this can improve product QA, decision support, and complex extraction when accuracy matters more than per-request cost.

Key finding

RECONCILE boosts team accuracy on Date Understanding by a large margin versus a leading multi-agent debate baseline.

Numbers: 75.3 → 86.7 (+11.4pp)