Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
You can improve model trustworthiness without retraining: add short self-critique and minimal refinement prompts to reduce poor outputs and raise helpful/honest answers, trading small CPU/network costs for better user safety and satisfaction.
Summary TLDR
The paper evaluates ten popular LLMs on the HONESET honesty dataset and introduces "self-critique-guided curiosity refinement": a two-step in-context prompting add-on that asks the model to (1) critique its optimized answer and (2) make minimal edits to fix flaws. Using GPT-4o as an automated judge, curiosity-driven prompting already raised honesty and H2 (honesty+helpfulness) scores across all models. Adding the self-critique+refine steps further reduced poor responses and increased excellent responses, yielding 1.4%–4.3% relative H2 gains over curiosity-driven prompting on HONESET. The method requires no fine-tuning but adds inference latency.
Problem Statement
Can in-context self-critique and a small refinement step improve an LLM's honesty and helpfulness without retraining, and how do ten widely used models behave under raw, curiosity-driven, and critique-guided refinement prompting on the HONESET honesty dataset?
Main Contribution
Benchmark: systematic in-context evaluation of 10 popular LLMs (OpenAI, Google, Meta) on HONESET using raw, curiosity-driven, and refinement prompts.
Method: self-critique-guided curiosity refinement—add two in-context steps (structured critique + minimal edits) to curiosity-driven prompting, no training needed.
Results: across models, refinement reduced poor responses, increased excellent responses, and improved H2 scores by 1.4%–4.3% over curiosity-driven prompting.
Key Findings
Curiosity-driven prompting raised purely honest rates across all ten models.
Self-critique-guided refinement further reduces poor responses and increases excellent responses.
Refinement boosts H2 (honesty+helpfulness) mean scores over curiosity-driven prompting by small but consistent margins.
Results
Purely honest rate (example)
H2 mean score (curiosity-driven → refinement)
H2 relative gain range (refinement vs curiosity)
Who Should Care
What To Try In 7 Days
Run a pilot over your most-used prompts: add curiosity-driven substeps (ask model what it lacks) then a self-critique + minimal-edit refinement.
Use an internal or public strong LLM as an automated judge to measure honest rate and H2 before/after on 200 representative queries.
If latency is acceptable, deploy refinement only for high-risk answers (e.g., medical, legal, or financial) to limit cost.
Reproducibility
Data Urls
- HONESET (Gao et al., 2024) referenced in paper
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Extra inference passes increase latency and compute; may not fit low-latency apps (Section 5.4).
- Evaluation uses GPT-4o as judge; judge-model bias and imperfect agreement with humans are acknowledged (Section 4.3.1).
- Paper measures only honesty and helpfulness; other alignment axes like harmlessness not evaluated.
When Not To Use
- When strict low-latency constraints make extra inference passes infeasible.
- If you cannot accept automated-judge evaluation without human validation for high-stakes outputs.
- When the application requires guarantees beyond honesty/helpfulness (e.g., fairness or safety audits).
Failure Modes
- The LLM judge may mis-evaluate nuanced cases, producing misleading improvements.
- Minimal edits in refinement may fail to fix deep factual errors or hallucinations.
- Repeated in-context passes could amplify certain biases or sycophancy if prompts are poorly designed.
Core Entities
Models
- GPT-4o
- GPT-4o-mini
- GPT-o3-mini
- Gemini 2.0 Flash
- Gemma 3 27B
- Gemma 2 9B
- Llama 3 70B
- Llama 3 8B
- Llama 4 Scout
- Llama 4 Maverick
Metrics
- H2 score (honesty + helpfulness)
- purely honest rate
- banded quality frequencies (poor/medium/excellent)
Datasets
- HONESET (Gao et al., 2024)

