A benchmark that measures whether LLMs follow prompts or their own memory when prompts conflict with stored knowledge

September 29, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper gives a usable dataset and clear metrics; results are practical but limited by dataset scope and partial model coverage, so apply findings cautiously in new domains.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Jiahao Ying, Yixin Cao, Kai Xiong, Yidong He, Long Cui, Yongbin Liu

Links

Abstract / PDF

Why It Matters For Business

Knowing if a model trusts prompts or its own memory changes how you design retrieval, prompts, and monitoring: pick high-RR models for using fresh external data, or use instruction-tuned dependent models when you need strict prompt compliance.

Who Should Care

Summary TLDR

The authors build KRE, a 11.7k-sample benchmark that tests how LLMs behave when a prompt conflicts with model memory. They define metrics (Vulnerable Robustness VR, Resilient Robustness RR, Factual Robustness FR, and Decision-Making Style Score DMSS), run seven models (GPT-4, ChatGPT, Claude, Bard, Vicuna-13B, LLaMA-13B, LLaMA-2-13B-chat), and show: models differ systematically in whether they rely on prompts (dependent) or internal memory (intuitive); role-play prompts can shift styles but adaptivity varies widely; hints reduce wrong prompt influence but raise invalid answers. The dataset, evaluation pipeline, and metrics are the main deliverables.

Problem Statement

LLMs sometimes receive prompts that conflict with knowledge stored in their parameters. We lack a clear, quantitative way to (1) classify whether a model follows prompts or its own memory, and (2) measure how robust a model is when prompt and memory disagree. This hurts real setups like RAG where external context can be newer or noisy.

Main Contribution

KRE: a 11,684-sample benchmark (from SQuAD, MuSiQue, ECQA, e-CARE) that pairs golden and misleading context for multiple-choice QA.

A robustness pipeline and four metrics: Vulnerable Robustness (VR), Resilient Robustness (RR), combined Factual Robustness (FR), and Decision-Making Style Score (DMSS).

Key Findings

GPT-4 achieves the highest ability to use correct prompt facts (RR) and highest overall factual robustness (FR) on the KRE benchmark.

NumbersGPT-4: VR=50, RR=81, FR≈66 (Table 9)

Practical UseFor applications that must prioritize external, up-to-date info (e.g., RAG), prefer higher-RR models like GPT-4 to better leverage accurate retrievals.

Evidence RefTable 9

Many instruction-tuned medium models tend to follow external prompts (dependent style); 4 out of 7 tested models are classified as dependent.

Numbers4/7 models show dependent style (Table 2 DMSS/Style)

Practical UseIf your deployed model is instruction-tuned and medium-sized, assume it will obey prompts more than its own memory—design retrieval quality and prompt checks accordingly.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
VR (Vulnerable Robustness)GPT-4 50, ChatGPT 32, Bard 54, Vicuna-13B 25, LLaMA-13B 20, LLaMA-2-13B-chat 24, Claude 34KRE overallTable 9 shows VR (%) per modelTable 9
RR (Resilient Robustness)GPT-4 81, ChatGPT 79, Bard 68, Vicuna-13B 48, LLaMA-13B 21, LLaMA-2-13B-chat 62, Claude 57KRE overallTable 9 shows RR (%) per modelTable 9

What To Try In 7 Days

Run a quick memory assessment on your model with 200 domain QA pairs to split D+ vs D- (use zero-shot).

Measure VR/RR on a small KRE-style set: inject correct and misleading contexts and record VR/RR.

Test a 'hint' prompt (warn about misleading context) and log change in wrong vs invalid answers to decide trade-off tolerance.

Agent Features

Memory
parametric memory probing (QA zero-shot)
Tool Use
role-play instruction

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation uses KRE and subsets; findings may not generalize beyond these tasks and domains.

Metrics focus on multiple-choice QA; outcomes may differ on free-form generation or other tasks.

When Not To Use

For non-knowledge-intensive tasks like casual dialog where prompt vs memory conflict is irrelevant.

As sole evidence for model behavior on open-ended generation or domain-specific specialty tasks.

Failure Modes

Instruction/hint reduces wrong answers but increases invalid/abstain outputs.

Few-shot examples can confuse some models and lengthen context, harming robustness.

Core Entities

Models

GPT-4ChatGPTClaudeBardVicuna-13BLLaMA-13BLLaMA-2-13B-chat

Metrics

VRRRFRDMSSAdaptivityUpper-bound

Datasets

KREMuSiQueSQuAD v2.0ECQAe-CARE

Benchmarks

KRE