Have LLMs 'think about their thinking' to boost understanding on NLU tasks

August 10, 20237 min

Overview

Decision SnapshotNeeds Validation

The method is simple and replicable; evaluation covers multiple models and 10 datasets, but results rely on API calls and manual prompt design so real‑world tuning is needed.

Citations9

Evidence Strength0.80

Confidence0.88

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 30%

Production readiness: 60%

Novelty: 70%

Authors

Yuqing Wang, Yun Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Metacognitive Prompting is a low‑cost way to improve model understanding on domain text (law, medicine) without retraining; expect modest average gains and larger wins on specialized datasets.

Who Should Care

Summary TLDR

The authors introduce Metacognitive Prompting (MP): a five-step prompt template that asks a model to (1) understand input, (2) give a preliminary judgment, (3) critically reassess that judgment, (4) provide a final answer with reasoning, and (5) report confidence. Evaluated on 10 NLU datasets (general, biomedical, legal) and four LLMs (Llama‑2‑13b, PaLM‑bison, GPT‑3.5, GPT‑4), MP consistently improves accuracy/F1 over Chain‑of‑Thought (CoT) baselines. Gains are largest on domain tasks (law, medicine). Key caveats: MP can cause 'overthinking' and 'overcorrection', requires manual prompt design, and verbalized confidence is imperfect.

Problem Statement

Current prompts (e.g., Chain‑of‑Thought) help stepwise reasoning but do not reliably deepen model understanding of meaning and context. The paper asks: can a human‑inspired introspective prompting flow improve natural language understanding across general and domain datasets?

Main Contribution

Introduce Metacognitive Prompting (MP): a five‑stage prompt that imitates human self‑reflection to improve understanding.

Large empirical study across 10 NLU datasets and four LLMs showing MP outperforms CoT and variants in zero‑ and few‑shot settings.

Key Findings

MP gives a consistent aggregate performance uplift over CoT in zero‑shot settings.

NumbersRelative boost 4.8%–6.4% vs CoT (zero‑shot, averaged across models)

Practical UseSwap 'Let's think step by step' for an MP template to gain ~5% relative performance on average without extra training.

Evidence RefSection 5.2, Fig.3

MP helps most on domain NLU tasks (legal and biomedical).

NumbersEUR‑LEX µ‑F1 29.935.6 (+5.7 abs); MedNLI acc. +4.3% over PS; UNFAIR‑ToS µ‑F1 +9.6% over PS

Practical UseUse MP when handling legal or medical text to get noticeably better label accuracy than standard CoT prompts.

Evidence RefTable 3; Section 5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
avg relative improvement (zero-shot) vs CoT4.8%–6.4%CoT (zero-shot)relative +4.8%–6.4%average over 10 NLU datasetsSection 5.2, Fig.3 and text reporting aggregated gainsTable 3, Section 5.2
EUR-LEX µ‑F1 (zero-shot)MP 35.6 vs CoT 29.9CoT+5.7 µ‑F1 (absolute)EUR-LEXTable 3 shows µ‑F1 29.9 (CoT) → 35.6 (MP)Table 3

What To Try In 7 Days

Run zero‑shot MP on a held‑out sample of your task and compare to your current prompt.

Use MP on domain examples (contracts, clinical notes) first — highest gains shown there.

Log verbalized confidence and track high‑confidence errors to calibrate thresholds before deployment.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Designing MP templates needs manual effort and domain tuning.

Evaluation uses 600 random validation examples per dataset due to API limits; broader sampling may change results.

When Not To Use

For trivial classification tasks where extra steps cause overthinking and hurt accuracy.

In low‑latency or very low‑cost settings since MP increases prompt length and API tokens.

Failure Modes

Overthinking: MP over‑complicates simple inputs and flips correct initial answers.

Overcorrection: MP abandons a correct initial judgment and moves to an incorrect final answer.

Core Entities

Models

Llama-2-13b-chatPaLM-bison-chatGPT-3.5-turboGPT-4

Metrics

Accuracymicro-F1macro-F1

Datasets

QQPQNLIBoolQWiCBC5CDR-chemDDIMedNLIEUR-LEXLEDGARUNFAIR-ToS

Benchmarks

GLUESuperGLUEBLUELexGLUE