Survey of practical methods to improve reasoning in large language models

Overview

Decision SnapshotNeeds Validation

The surveyed methods are practical: start with CoT and retrieval for quick gains, and use fine-tuning or RLHF when you need consistent, production-grade reasoning.

Citations8

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 0/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Avinash Patil, Aryan Jadon

Links

Abstract / PDF

Why It Matters For Business

Better reasoning reduces wrong conclusions, lowers downstream verification cost, and enables LLMs to be used in higher-stakes workflows like finance, legal, and scientific support.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Engineering Lead

Summary TLDR

This short survey groups recent techniques that improve LLM reasoning into three buckets: prompting tricks (Chain-of-Thought, Self-Consistency, Tree-of-Thought, PAL), architectural changes (retrieval augmentation, neuro-symbolic hybrids, memory modules, GNNs), and learning methods (fine-tuning on reasoning datasets, RLHF, self-supervision). It reviews common benchmarks (GSM8K, MATH, ARC, HotpotQA, LogiQA) and highlights open problems: hallucinations, adversarial fragility, domain transfer, and evaluation gaps. The paper is a practical map for engineers deciding which approach to try depending on task type and compute trade-offs.

Problem Statement

Large language models write fluent text but often fail on multi-step, logical, or formal reasoning. They hallucinate facts, produce inconsistent intermediate steps, and struggle to transfer reasoning across domains. The paper asks: which prompting, model, or training techniques reliably improve reasoning and what are their trade-offs?

Main Contribution

Organizes recent methods to boost LLM reasoning into prompting, architecture, and learning categories.

Summarizes popular benchmarks and metrics used to measure reasoning ability.

Key Findings

Chain-of-Thought prompting helps multi-step problems by making the model emit intermediate steps.

Practical UseTry CoT prompting first for math, logic, and complex QA to increase transparency and often improve accuracy; it needs careful prompt design and larger models.

Evidence Ref[11]

Self-Consistency improves CoT results by aggregating multiple reasoning paths and voting for the final answer.

Practical UseIf single CoT runs are unstable, generate many chains and pick the consensus answer to reduce random errors.

Evidence Ref[12]

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
CoT vs standard prompting	Reported accuracy improvements on arithmetic and logic tasks across studies	Standard prompting	—	GSM8K, math/logic benchmarks	CoT increases multi-step task accuracy relative to vanilla prompting	[11]
Self-Consistency effect	Higher reliability by majority voting across multiple CoT samples	Single CoT chain	—	Math and reasoning tasks (reported across studies)	Self-Consistency aggregates multiple chains to reduce errors	[12]

What To Try In 7 Days

Implement Chain-of-Thought prompts for multi-step QA and measure accuracy uplift on a held-out set

Add Self-Consistency: generate multiple reasoning chains and vote to stabilize answers

Prototype a retrieval step (dense or BM25) to ground answers for domain-specific queries

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Survey summarizes prior work; it does not present new experiments or unified benchmarks.

Many claims rely on heterogeneous studies and lack standardized numeric comparisons.

When Not To Use

Do not rely on CoT alone for high-stakes decisions without external verification.

Avoid Tree-of-Thought or massive self-consistency sampling when latency or cost are tight.

Failure Modes

Hallucinated intermediate steps that look plausible but are wrong.

Error propagation along CoT chains leading to incorrect final answers.

Core Entities

Models

GPT-4PaLMLLaMADeepSeek-R1Toolformer

Metrics

AccuracyExact MatchF1Self-ConsistencyMulti-Hop Reasoning ScoreLogical ConsistencyBrier Score

Datasets

GSM8KMATHARCHotpotQALogiQAProofWriterBIG-BenchANLIHellaSwagMMLUHumanEvalSWAG

Benchmarks

GSM8KMATHARCHotpotQALogiQAProofWriterBIG-BenchANLIMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Chain-of-Thought prompting helps multi-step problems by making the model emit intermediate steps.

Self-Consistency improves CoT results by aggregating multiple reasoning paths and voting for the final answer.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

RL fine-tuning raises visual reasoning scores but weakens reasoning faithfulness and robustness to misleading text

Key finding

Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

Key finding

Build expert element-based test sets and use a chain-of-thought prompt (SumCoT) to get LLMs to write more complete news summaries

Key finding

Which LLM and reasoning setup solves Raven-style visual puzzles best?

Key finding

Embed executable code in prompts to ground LLM reasoning and cut hallucinations

Key finding