Overview
LoT is practically ready as a prompt‑level add‑on: it improves accuracy on strong LLMs but increases API calls and can hurt small models or mislead when post‑hoc reviews hallucinate.
Citations11
Evidence Strength0.70
Confidence0.82
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 30%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
LoT is a low‑effort prompting add‑on that raises reasoning accuracy on strong LLMs; use it when correctness matters and you can afford extra API calls.
Who Should Care
Summary TLDR
LoT (Logical Thoughts) is a prompting framework that adds a think‑verify‑revise loop based on simple logic (reductio ad absurdum). For each chain-of-thought step the model generates opposing post‑hoc explanations, then a discriminator picks which side to keep; failing steps are revised and the chain re‑generated. On modern models (GPT-3.5, GPT-4) LoT gives small but consistent accuracy gains on math, commonsense, causal, symbolic, and social tasks. Gains are larger and safer on stronger LLMs; small models sometimes get worse.
Problem Statement
Chain-of-thought helps LLMs reason, but steps can be logically unsound and errors propagate. The paper asks: can we automatically verify each step with logical checks and revise only the steps that fail, improving zero‑shot CoT without handcrafted examples?
Main Contribution
LoT: a zero-shot prompting loop that thinks, generates post‑hoc opposing explanations, selects the better view, and revises failing steps
Two variants: Cmps‑LoT (compose negation) and Adpt‑LoT (generate both T and ¬T explanations and let model choose)
Key Findings
Adpt‑LoT improves zero‑shot CoT accuracy on math and reasoning tasks for strong models
GPT‑4 gains similar improvements from LoT on hard tasks
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 80.15% | CoT 78.75% | +1.40% abs | GSM8K (GPT-3.5-turbo) | Adpt-LoT improves math accuracy | Table 4 |
| Accuracy | 60.63% | CoT 57.09% | +3.54% abs | AQuA (GPT-3.5-turbo) | Adpt-LoT improves multi-choice math accuracy | Table 4 |
What To Try In 7 Days
Run Adpt‑LoT on a representative set of your prompts with your target LLM and compare accuracy.
Measure added API cost from extra verification calls and decide a cost/quality threshold.
If using a small open model, test LoT cautiously — it can harm results for weak models.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Works best with large, capable LLMs; small models can degrade
Increases API calls and latency because each step may generate multiple reviews
When Not To Use
On small or weak LLMs without validation
When low latency or minimal API calls are required
Failure Modes
Post‑hoc reviews can hallucinate and then mislead the discriminator (false corrections)
Small models may fail to follow the verification instructions and get worse

