A benchmark that measures whether LLMs follow prompts or their own memory when prompts conflict with stored knowledge

Overview

Decision SnapshotNeeds Validation

The paper gives a usable dataset and clear metrics; results are practical but limited by dataset scope and partial model coverage, so apply findings cautiously in new domains.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Jiahao Ying, Yixin Cao, Kai Xiong, Yidong He, Long Cui, Yongbin Liu

Links

Abstract / PDF

Why It Matters For Business

Knowing if a model trusts prompts or its own memory changes how you design retrieval, prompts, and monitoring: pick high-RR models for using fresh external data, or use instruction-tuned dependent models when you need strict prompt compliance.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors build KRE, a 11.7k-sample benchmark that tests how LLMs behave when a prompt conflicts with model memory. They define metrics (Vulnerable Robustness VR, Resilient Robustness RR, Factual Robustness FR, and Decision-Making Style Score DMSS), run seven models (GPT-4, ChatGPT, Claude, Bard, Vicuna-13B, LLaMA-13B, LLaMA-2-13B-chat), and show: models differ systematically in whether they rely on prompts (dependent) or internal memory (intuitive); role-play prompts can shift styles but adaptivity varies widely; hints reduce wrong prompt influence but raise invalid answers. The dataset, evaluation pipeline, and metrics are the main deliverables.

Problem Statement

LLMs sometimes receive prompts that conflict with knowledge stored in their parameters. We lack a clear, quantitative way to (1) classify whether a model follows prompts or its own memory, and (2) measure how robust a model is when prompt and memory disagree. This hurts real setups like RAG where external context can be newer or noisy.

Main Contribution

KRE: a 11,684-sample benchmark (from SQuAD, MuSiQue, ECQA, e-CARE) that pairs golden and misleading context for multiple-choice QA.

A robustness pipeline and four metrics: Vulnerable Robustness (VR), Resilient Robustness (RR), combined Factual Robustness (FR), and Decision-Making Style Score (DMSS).

Key Findings

GPT-4 achieves the highest ability to use correct prompt facts (RR) and highest overall factual robustness (FR) on the KRE benchmark.

NumbersGPT-4: VR=50, RR=81, FR≈66 (Table 9)

Practical UseFor applications that must prioritize external, up-to-date info (e.g., RAG), prefer higher-RR models like GPT-4 to better leverage accurate retrievals.

Evidence RefTable 9

Many instruction-tuned medium models tend to follow external prompts (dependent style); 4 out of 7 tested models are classified as dependent.

Numbers4/7 models show dependent style (Table 2 DMSS/Style)

Practical UseIf your deployed model is instruction-tuned and medium-sized, assume it will obey prompts more than its own memory—design retrieval quality and prompt checks accordingly.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
VR (Vulnerable Robustness)	GPT-4 50, ChatGPT 32, Bard 54, Vicuna-13B 25, LLaMA-13B 20, LLaMA-2-13B-chat 24, Claude 34	—	—	KRE overall	Table 9 shows VR (%) per model	Table 9
RR (Resilient Robustness)	GPT-4 81, ChatGPT 79, Bard 68, Vicuna-13B 48, LLaMA-13B 21, LLaMA-2-13B-chat 62, Claude 57	—	—	KRE overall	Table 9 shows RR (%) per model	Table 9

What To Try In 7 Days

Run a quick memory assessment on your model with 200 domain QA pairs to split D+ vs D- (use zero-shot).

Measure VR/RR on a small KRE-style set: inject correct and misleading contexts and record VR/RR.

Test a 'hint' prompt (warn about misleading context) and log change in wrong vs invalid answers to decide trade-off tolerance.

Agent Features

Memory

parametric memory probing (QA zero-shot)

Tool Use

role-play instruction

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation uses KRE and subsets; findings may not generalize beyond these tasks and domains.

Metrics focus on multiple-choice QA; outcomes may differ on free-form generation or other tasks.

When Not To Use

For non-knowledge-intensive tasks like casual dialog where prompt vs memory conflict is irrelevant.

As sole evidence for model behavior on open-ended generation or domain-specific specialty tasks.

Failure Modes

Instruction/hint reduces wrong answers but increases invalid/abstain outputs.

Few-shot examples can confuse some models and lengthen context, harming robustness.

Core Entities

Models

GPT-4ChatGPTClaudeBardVicuna-13BLLaMA-13BLLaMA-2-13B-chat

Metrics

VRRRFRDMSSAdaptivityUpper-bound

Datasets

KREMuSiQueSQuAD v2.0ECQAe-CARE

Benchmarks

KRE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 achieves the highest ability to use correct prompt facts (RR) and highest overall factual robustness (FR) on the KRE benchmark.

Many instruction-tuned medium models tend to follow external prompts (dependent style); 4 out of 7 tested models are classified as dependent.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding