Have LLMs 'think about their thinking' to boost understanding on NLU tasks

August 10, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.3

Citation Count

9

Authors

Yuqing Wang, Yun Zhao

Links

Abstract / PDF

Why It Matters For Business

Metacognitive Prompting is a low‑cost way to improve model understanding on domain text (law, medicine) without retraining; expect modest average gains and larger wins on specialized datasets.

Summary TLDR

The authors introduce Metacognitive Prompting (MP): a five-step prompt template that asks a model to (1) understand input, (2) give a preliminary judgment, (3) critically reassess that judgment, (4) provide a final answer with reasoning, and (5) report confidence. Evaluated on 10 NLU datasets (general, biomedical, legal) and four LLMs (Llama‑2‑13b, PaLM‑bison, GPT‑3.5, GPT‑4), MP consistently improves accuracy/F1 over Chain‑of‑Thought (CoT) baselines. Gains are largest on domain tasks (law, medicine). Key caveats: MP can cause 'overthinking' and 'overcorrection', requires manual prompt design, and verbalized confidence is imperfect.

Problem Statement

Current prompts (e.g., Chain‑of‑Thought) help stepwise reasoning but do not reliably deepen model understanding of meaning and context. The paper asks: can a human‑inspired introspective prompting flow improve natural language understanding across general and domain datasets?

Main Contribution

Introduce Metacognitive Prompting (MP): a five‑stage prompt that imitates human self‑reflection to improve understanding.

Large empirical study across 10 NLU datasets and four LLMs showing MP outperforms CoT and variants in zero‑ and few‑shot settings.

Manual error and confidence analysis that identifies common failure modes and guides future calibration and domain adaptations.

Key Findings

MP gives a consistent aggregate performance uplift over CoT in zero‑shot settings.

NumbersRelative boost 4.8%–6.4% vs CoT (zero‑shot, averaged across models)

MP helps most on domain NLU tasks (legal and biomedical).

NumbersEUR‑LEX µ‑F1 29.9 → 35.6 (+5.7 abs); MedNLI acc. +4.3% over PS; UNFAIR‑ToS µ‑F1 +9.6% over PS

MP introduces two major error modes when it fails.

NumbersOverthinking 68.3% of MP errors; Overcorrection 31.7% of MP errors

Model‑reported confidence under MP correlates imperfectly with correctness.

NumbersTP 55.6%, FP 32.5%, TN 6.8%, FN 5.1% (averaged)

Results

avg relative improvement (zero-shot) vs CoT

Value4.8%–6.4%

BaselineCoT (zero-shot)

EUR-LEX µ‑F1 (zero-shot)

ValueMP 35.6 vs CoT 29.9

BaselineCoT

Accuracy

ValueMP +4.3% (relative) over PS

BaselinePlan‑and‑Solve (PS)

MP error type distribution

ValueOverthinking 68.3%, Overcorrection 31.7%

Confidence vs correctness under MP

ValueTP 55.6%, FP 32.5%, TN 6.8%, FN 5.1%

Who Should Care

What To Try In 7 Days

Run zero‑shot MP on a held‑out sample of your task and compare to your current prompt.

Use MP on domain examples (contracts, clinical notes) first — highest gains shown there.

Log verbalized confidence and track high‑confidence errors to calibrate thresholds before deployment.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Designing MP templates needs manual effort and domain tuning.
  • Evaluation uses 600 random validation examples per dataset due to API limits; broader sampling may change results.
  • Verbalized confidence is informative but not fully calibrated; authors recommend hybrid calibration.
  • Ethics, bias, and privacy effects of introspective prompts are not deeply studied.

When Not To Use

  • For trivial classification tasks where extra steps cause overthinking and hurt accuracy.
  • In low‑latency or very low‑cost settings since MP increases prompt length and API tokens.
  • When strict, calibrated uncertainty estimates are required without added calibration steps.

Failure Modes

  • Overthinking: MP over‑complicates simple inputs and flips correct initial answers.
  • Overcorrection: MP abandons a correct initial judgment and moves to an incorrect final answer.
  • Domain term misinterpretation in biomedical datasets (terminological misalignment).
  • Statutory interpretation errors in legal tasks where nuance leads to wrong labels.
  • High‑confidence false positives: confident but incorrect predictions remain common.

Core Entities

Models

  • Llama-2-13b-chat
  • PaLM-bison-chat
  • GPT-3.5-turbo
  • GPT-4

Metrics

  • Accuracy
  • micro-F1
  • macro-F1

Datasets

  • QQP
  • QNLI
  • BoolQ
  • WiC
  • BC5CDR-chem
  • DDI
  • MedNLI
  • EUR-LEX
  • LEDGAR
  • UNFAIR-ToS

Benchmarks

  • GLUE
  • SuperGLUE
  • BLUE
  • LexGLUE