Use simple logic checks to make zero‑shot chain-of-thought answers more reliable

September 23, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.3

Citation Count

11

Authors

Xufeng Zhao, Mengdi Li, Wenhao Lu, Cornelius Weber, Jae Hee Lee, Kun Chu, Stefan Wermter

Links

Abstract / PDF

Why It Matters For Business

LoT is a low‑effort prompting add‑on that raises reasoning accuracy on strong LLMs; use it when correctness matters and you can afford extra API calls.

Summary TLDR

LoT (Logical Thoughts) is a prompting framework that adds a think‑verify‑revise loop based on simple logic (reductio ad absurdum). For each chain-of-thought step the model generates opposing post‑hoc explanations, then a discriminator picks which side to keep; failing steps are revised and the chain re‑generated. On modern models (GPT-3.5, GPT-4) LoT gives small but consistent accuracy gains on math, commonsense, causal, symbolic, and social tasks. Gains are larger and safer on stronger LLMs; small models sometimes get worse.

Problem Statement

Chain-of-thought helps LLMs reason, but steps can be logically unsound and errors propagate. The paper asks: can we automatically verify each step with logical checks and revise only the steps that fail, improving zero‑shot CoT without handcrafted examples?

Main Contribution

LoT: a zero-shot prompting loop that thinks, generates post‑hoc opposing explanations, selects the better view, and revises failing steps

Two variants: Cmps‑LoT (compose negation) and Adpt‑LoT (generate both T and ¬T explanations and let model choose)

Comprehensive zero‑shot experiments across tasks and models showing modest but consistent benefits on larger LLMs

A practical ablation showing post‑hoc opposing reviews (Adpt‑LoT) outperform simple self‑check and Cmps‑LoT

Key Findings

Adpt‑LoT improves zero‑shot CoT accuracy on math and reasoning tasks for strong models

NumbersGSM8K: 78.75 → 80.15 (+1.40% abs); AQuA: 57.09 → 60.63 (+3.54% abs)

GPT‑4 gains similar improvements from LoT on hard tasks

NumbersGSM8K: 94.29 → 95.71 (+1.42% abs); Date: 83.09 → 85.16 (+2.07% abs)

LoT increases step‑wise revision activity, especially on stronger models and harder tasks

NumbersRevision freq (GPT-3.5): GSM8K 16%, AQuA 28%, Date 32%

Adpt‑LoT outperforms naive self‑check and Cmps‑LoT for error detection

NumbersOn GSM8K (GPT-3.5): CoT 78.75, Self‑Check 76.15, Cmps‑LoT 77.67, LoT 80.15

LoT can worsen outcomes on small models and occasionally hallucinate during verification

NumbersWorsening rates for smaller models (examples): Vicuna‑7b AQuA worsened by 10.91% (↓)

Results

Accuracy

Value80.15%

BaselineCoT 78.75%

Accuracy

Value60.63%

BaselineCoT 57.09%

Accuracy

Value52.37%

BaselineCoT 51.26%

Accuracy

Value95.71%

BaselineCoT 94.29%

revision_frequency

Value16%

BaselineN/A

worsening_rate

Value1.79%

BaselineN/A

Who Should Care

What To Try In 7 Days

Run Adpt‑LoT on a representative set of your prompts with your target LLM and compare accuracy.

Measure added API cost from extra verification calls and decide a cost/quality threshold.

If using a small open model, test LoT cautiously — it can harm results for weak models.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Works best with large, capable LLMs; small models can degrade
  • Increases API calls and latency because each step may generate multiple reviews
  • Does not address model bias or grounding to external facts
  • Experiments are zero‑shot only; few‑shot or fine‑tuning effects unexplored
  • Verification sometimes relies on model‑generated reviews that can hallucinate

When Not To Use

  • On small or weak LLMs without validation
  • When low latency or minimal API calls are required
  • When you require grounding to external facts rather than internal logical checks

Failure Modes

  • Post‑hoc reviews can hallucinate and then mislead the discriminator (false corrections)
  • Small models may fail to follow the verification instructions and get worse
  • Extra revisions increase cost and latency without changing the final answer for some problems

Core Entities

Models

  • Vicuna-7b
  • Vicuna-13b
  • Vicuna-33b
  • GPT-3.5-turbo
  • GPT-4

Metrics

  • Accuracy
  • revision_frequency
  • reasoning_step_count
  • worsening_rate
  • improvement_rate

Datasets

  • GSM8K
  • AQuA
  • DateUnderstanding
  • OddOneOut
  • CauseEffect
  • ShuffledObjects
  • LastLetter
  • SocialQA

Benchmarks

  • Accuracy