Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.3
Citation Count
11
Why It Matters For Business
LoT is a low‑effort prompting add‑on that raises reasoning accuracy on strong LLMs; use it when correctness matters and you can afford extra API calls.
Summary TLDR
LoT (Logical Thoughts) is a prompting framework that adds a think‑verify‑revise loop based on simple logic (reductio ad absurdum). For each chain-of-thought step the model generates opposing post‑hoc explanations, then a discriminator picks which side to keep; failing steps are revised and the chain re‑generated. On modern models (GPT-3.5, GPT-4) LoT gives small but consistent accuracy gains on math, commonsense, causal, symbolic, and social tasks. Gains are larger and safer on stronger LLMs; small models sometimes get worse.
Problem Statement
Chain-of-thought helps LLMs reason, but steps can be logically unsound and errors propagate. The paper asks: can we automatically verify each step with logical checks and revise only the steps that fail, improving zero‑shot CoT without handcrafted examples?
Main Contribution
LoT: a zero-shot prompting loop that thinks, generates post‑hoc opposing explanations, selects the better view, and revises failing steps
Two variants: Cmps‑LoT (compose negation) and Adpt‑LoT (generate both T and ¬T explanations and let model choose)
Comprehensive zero‑shot experiments across tasks and models showing modest but consistent benefits on larger LLMs
A practical ablation showing post‑hoc opposing reviews (Adpt‑LoT) outperform simple self‑check and Cmps‑LoT
Key Findings
Adpt‑LoT improves zero‑shot CoT accuracy on math and reasoning tasks for strong models
GPT‑4 gains similar improvements from LoT on hard tasks
LoT increases step‑wise revision activity, especially on stronger models and harder tasks
Adpt‑LoT outperforms naive self‑check and Cmps‑LoT for error detection
LoT can worsen outcomes on small models and occasionally hallucinate during verification
Results
Accuracy
Accuracy
Accuracy
Accuracy
revision_frequency
worsening_rate
Who Should Care
What To Try In 7 Days
Run Adpt‑LoT on a representative set of your prompts with your target LLM and compare accuracy.
Measure added API cost from extra verification calls and decide a cost/quality threshold.
If using a small open model, test LoT cautiously — it can harm results for weak models.
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Works best with large, capable LLMs; small models can degrade
- Increases API calls and latency because each step may generate multiple reviews
- Does not address model bias or grounding to external facts
- Experiments are zero‑shot only; few‑shot or fine‑tuning effects unexplored
- Verification sometimes relies on model‑generated reviews that can hallucinate
When Not To Use
- On small or weak LLMs without validation
- When low latency or minimal API calls are required
- When you require grounding to external facts rather than internal logical checks
Failure Modes
- Post‑hoc reviews can hallucinate and then mislead the discriminator (false corrections)
- Small models may fail to follow the verification instructions and get worse
- Extra revisions increase cost and latency without changing the final answer for some problems
Core Entities
Models
- Vicuna-7b
- Vicuna-13b
- Vicuna-33b
- GPT-3.5-turbo
- GPT-4
Metrics
- Accuracy
- revision_frequency
- reasoning_step_count
- worsening_rate
- improvement_rate
Datasets
- GSM8K
- AQuA
- DateUnderstanding
- OddOneOut
- CauseEffect
- ShuffledObjects
- LastLetter
- SocialQA
Benchmarks
- Accuracy

