Use simple logic checks to make zero‑shot chain-of-thought answers more reliable

September 23, 20236 min

Overview

Decision SnapshotNeeds Validation

LoT is practically ready as a prompt‑level add‑on: it improves accuracy on strong LLMs but increases API calls and can hurt small models or mislead when post‑hoc reviews hallucinate.

Citations11

Evidence Strength0.70

Confidence0.82

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 30%

Production readiness: 60%

Novelty: 40%

Authors

Xufeng Zhao, Mengdi Li, Wenhao Lu, Cornelius Weber, Jae Hee Lee, Kun Chu, Stefan Wermter

Links

Abstract / PDF / Code

Why It Matters For Business

LoT is a low‑effort prompting add‑on that raises reasoning accuracy on strong LLMs; use it when correctness matters and you can afford extra API calls.

Who Should Care

Summary TLDR

LoT (Logical Thoughts) is a prompting framework that adds a think‑verify‑revise loop based on simple logic (reductio ad absurdum). For each chain-of-thought step the model generates opposing post‑hoc explanations, then a discriminator picks which side to keep; failing steps are revised and the chain re‑generated. On modern models (GPT-3.5, GPT-4) LoT gives small but consistent accuracy gains on math, commonsense, causal, symbolic, and social tasks. Gains are larger and safer on stronger LLMs; small models sometimes get worse.

Problem Statement

Chain-of-thought helps LLMs reason, but steps can be logically unsound and errors propagate. The paper asks: can we automatically verify each step with logical checks and revise only the steps that fail, improving zero‑shot CoT without handcrafted examples?

Main Contribution

LoT: a zero-shot prompting loop that thinks, generates post‑hoc opposing explanations, selects the better view, and revises failing steps

Two variants: Cmps‑LoT (compose negation) and Adpt‑LoT (generate both T and ¬T explanations and let model choose)

Key Findings

Adpt‑LoT improves zero‑shot CoT accuracy on math and reasoning tasks for strong models

NumbersGSM8K: 78.7580.15 (+1.40% abs); AQuA: 57.0960.63 (+3.54% abs)

Practical UseIf you use GPT‑3.5 for math/QA, add the LoT loop to get a small but reliable lift in accuracy on evaluated datasets.

Evidence RefTable 4 (GPT-3.5-turbo)

GPT‑4 gains similar improvements from LoT on hard tasks

NumbersGSM8K: 94.2995.71 (+1.42% abs); Date: 83.0985.16 (+2.07% abs)

Practical UseHigh‑capability models still benefit; use LoT to squeeze extra correctness when accuracy matters.

Evidence RefTable 1 (GPT-4 results)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy80.15%CoT 78.75%+1.40% absGSM8K (GPT-3.5-turbo)Adpt-LoT improves math accuracyTable 4
Accuracy60.63%CoT 57.09%+3.54% absAQuA (GPT-3.5-turbo)Adpt-LoT improves multi-choice math accuracyTable 4

What To Try In 7 Days

Run Adpt‑LoT on a representative set of your prompts with your target LLM and compare accuracy.

Measure added API cost from extra verification calls and decide a cost/quality threshold.

If using a small open model, test LoT cautiously — it can harm results for weak models.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Works best with large, capable LLMs; small models can degrade

Increases API calls and latency because each step may generate multiple reviews

When Not To Use

On small or weak LLMs without validation

When low latency or minimal API calls are required

Failure Modes

Post‑hoc reviews can hallucinate and then mislead the discriminator (false corrections)

Small models may fail to follow the verification instructions and get worse

Core Entities

Models

Vicuna-7bVicuna-13bVicuna-33bGPT-3.5-turboGPT-4

Metrics

Accuracyrevision_frequencyreasoning_step_countworsening_rateimprovement_rate

Datasets

GSM8KAQuADateUnderstandingOddOneOutCauseEffectShuffledObjectsLastLetterSocialQA

Benchmarks

Accuracy