Add a short self-critique and a lightweight refinement step to prompts and get measurably more honest and helpful LLM replies

June 19, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.5

Citation Count

0

Authors

Duc Hieu Ho, Chenglin Fan

Links

Abstract / PDF

Why It Matters For Business

You can improve model trustworthiness without retraining: add short self-critique and minimal refinement prompts to reduce poor outputs and raise helpful/honest answers, trading small CPU/network costs for better user safety and satisfaction.

Summary TLDR

The paper evaluates ten popular LLMs on the HONESET honesty dataset and introduces "self-critique-guided curiosity refinement": a two-step in-context prompting add-on that asks the model to (1) critique its optimized answer and (2) make minimal edits to fix flaws. Using GPT-4o as an automated judge, curiosity-driven prompting already raised honesty and H2 (honesty+helpfulness) scores across all models. Adding the self-critique+refine steps further reduced poor responses and increased excellent responses, yielding 1.4%–4.3% relative H2 gains over curiosity-driven prompting on HONESET. The method requires no fine-tuning but adds inference latency.

Problem Statement

Can in-context self-critique and a small refinement step improve an LLM's honesty and helpfulness without retraining, and how do ten widely used models behave under raw, curiosity-driven, and critique-guided refinement prompting on the HONESET honesty dataset?

Main Contribution

Benchmark: systematic in-context evaluation of 10 popular LLMs (OpenAI, Google, Meta) on HONESET using raw, curiosity-driven, and refinement prompts.

Method: self-critique-guided curiosity refinement—add two in-context steps (structured critique + minimal edits) to curiosity-driven prompting, no training needed.

Results: across models, refinement reduced poor responses, increased excellent responses, and improved H2 scores by 1.4%–4.3% over curiosity-driven prompting.

Key Findings

Curiosity-driven prompting raised purely honest rates across all ten models.

NumbersExample: GPT-4o 67.1% → 96.6% (Table 1)

Self-critique-guided refinement further reduces poor responses and increases excellent responses.

NumbersGPT-4o poor responses 20 → 0; Llama 3 8B poor 176 → 28 (Table 4)

Refinement boosts H2 (honesty+helpfulness) mean scores over curiosity-driven prompting by small but consistent margins.

NumbersRelative gains range 1.4%–4.3% across models (Table 6)

Results

Purely honest rate (example)

ValueGPT-4o: 67.1% (raw) → 96.6% (curiosity-driven)

Baselineraw prompting

H2 mean score (curiosity-driven → refinement)

ValueGPT-4o: 8.627 → 8.748

Baselinecuriosity-driven prompting

H2 relative gain range (refinement vs curiosity)

Value1.4%–4.3% across models

Baselinecuriosity-driven prompting

Who Should Care

What To Try In 7 Days

Run a pilot over your most-used prompts: add curiosity-driven substeps (ask model what it lacks) then a self-critique + minimal-edit refinement.

Use an internal or public strong LLM as an automated judge to measure honest rate and H2 before/after on 200 representative queries.

If latency is acceptable, deploy refinement only for high-risk answers (e.g., medical, legal, or financial) to limit cost.

Reproducibility

Data Urls

  • HONESET (Gao et al., 2024) referenced in paper

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Extra inference passes increase latency and compute; may not fit low-latency apps (Section 5.4).
  • Evaluation uses GPT-4o as judge; judge-model bias and imperfect agreement with humans are acknowledged (Section 4.3.1).
  • Paper measures only honesty and helpfulness; other alignment axes like harmlessness not evaluated.

When Not To Use

  • When strict low-latency constraints make extra inference passes infeasible.
  • If you cannot accept automated-judge evaluation without human validation for high-stakes outputs.
  • When the application requires guarantees beyond honesty/helpfulness (e.g., fairness or safety audits).

Failure Modes

  • The LLM judge may mis-evaluate nuanced cases, producing misleading improvements.
  • Minimal edits in refinement may fail to fix deep factual errors or hallucinations.
  • Repeated in-context passes could amplify certain biases or sycophancy if prompts are poorly designed.

Core Entities

Models

  • GPT-4o
  • GPT-4o-mini
  • GPT-o3-mini
  • Gemini 2.0 Flash
  • Gemma 3 27B
  • Gemma 2 9B
  • Llama 3 70B
  • Llama 3 8B
  • Llama 4 Scout
  • Llama 4 Maverick

Metrics

  • H2 score (honesty + helpfulness)
  • purely honest rate
  • banded quality frequencies (poor/medium/excellent)

Datasets

  • HONESET (Gao et al., 2024)