A benchmark that measures whether LLMs follow prompts or their own memory when prompts conflict with stored knowledge

September 29, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

3

Authors

Jiahao Ying, Yixin Cao, Kai Xiong, Yidong He, Long Cui, Yongbin Liu

Links

Abstract / PDF

Why It Matters For Business

Knowing if a model trusts prompts or its own memory changes how you design retrieval, prompts, and monitoring: pick high-RR models for using fresh external data, or use instruction-tuned dependent models when you need strict prompt compliance.

Summary TLDR

The authors build KRE, a 11.7k-sample benchmark that tests how LLMs behave when a prompt conflicts with model memory. They define metrics (Vulnerable Robustness VR, Resilient Robustness RR, Factual Robustness FR, and Decision-Making Style Score DMSS), run seven models (GPT-4, ChatGPT, Claude, Bard, Vicuna-13B, LLaMA-13B, LLaMA-2-13B-chat), and show: models differ systematically in whether they rely on prompts (dependent) or internal memory (intuitive); role-play prompts can shift styles but adaptivity varies widely; hints reduce wrong prompt influence but raise invalid answers. The dataset, evaluation pipeline, and metrics are the main deliverables.

Problem Statement

LLMs sometimes receive prompts that conflict with knowledge stored in their parameters. We lack a clear, quantitative way to (1) classify whether a model follows prompts or its own memory, and (2) measure how robust a model is when prompt and memory disagree. This hurts real setups like RAG where external context can be newer or noisy.

Main Contribution

KRE: a 11,684-sample benchmark (from SQuAD, MuSiQue, ECQA, e-CARE) that pairs golden and misleading context for multiple-choice QA.

A robustness pipeline and four metrics: Vulnerable Robustness (VR), Resilient Robustness (RR), combined Factual Robustness (FR), and Decision-Making Style Score (DMSS).

Large-scale evaluation of seven LLMs showing systematic styles (dependent, intuitive, rational) and that role-play instructions can change behavior with different adaptivity limits.

Key Findings

GPT-4 achieves the highest ability to use correct prompt facts (RR) and highest overall factual robustness (FR) on the KRE benchmark.

NumbersGPT-4: VR=50, RR=81, FR≈66 (Table 9)

Many instruction-tuned medium models tend to follow external prompts (dependent style); 4 out of 7 tested models are classified as dependent.

Numbers4/7 models show dependent style (Table 2 DMSS/Style)

Hints that warn about misleading context reduce the number of times models pick the misleading answer but increase invalid or 'I don't know' outputs.

NumbersChatGPT: misleading answers 3902→3638 (−264); invalid outputs 637→892 (+255) with hint (Table 8)

Models generally better extract factual knowledge from prompts than commonsense knowledge; RR is higher on factual datasets than on commonsense datasets.

Role-play instructions can change a model's decision style, but adaptivity (how much a model can be flipped) varies widely across models.

NumbersGPT-4 adaptivity ≈0.8 vs LLaMA-2 adaptivity ≈0.31 (Table 2)

Few-shot examples do not reliably improve robustness under conflict; 'all-positive' few-shot often yields highest RR and VR but does not always beat zero-shot.

Results

VR (Vulnerable Robustness)

ValueGPT-4 50, ChatGPT 32, Bard 54, Vicuna-13B 25, LLaMA-13B 20, LLaMA-2-13B-chat 24, Claude 34

RR (Resilient Robustness)

ValueGPT-4 81, ChatGPT 79, Bard 68, Vicuna-13B 48, LLaMA-13B 21, LLaMA-2-13B-chat 62, Claude 57

FR (Factual Robustness)

ValueGPT-4 ≈66, ChatGPT 56, Bard 61, Vicuna-13B 36, LLaMA-13B 20, LLaMA-2-13B-chat 39, Claude 45

Accuracy

ValueChatGPT: KRE overall 81.5%; Vicuna-13B: 70.1% (Table 1 aggregated)

Hint effect on outputs (counts)

ValueChatGPT misleading answers 3902→3638; invalids 637→892 with hint

Who Should Care

What To Try In 7 Days

Run a quick memory assessment on your model with 200 domain QA pairs to split D+ vs D- (use zero-shot).

Measure VR/RR on a small KRE-style set: inject correct and misleading contexts and record VR/RR.

Test a 'hint' prompt (warn about misleading context) and log change in wrong vs invalid answers to decide trade-off tolerance.

Agent Features

Memory

  • parametric memory probing (QA zero-shot)

Tool Use

  • role-play instruction

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluation uses KRE and subsets; findings may not generalize beyond these tasks and domains.
  • Metrics focus on multiple-choice QA; outcomes may differ on free-form generation or other tasks.
  • Negative contexts and some options were generated by ChatGPT; generation could bias difficulty.
  • Not all models were exhaustively tested on the full dataset due to compute limits.

When Not To Use

  • For non-knowledge-intensive tasks like casual dialog where prompt vs memory conflict is irrelevant.
  • As sole evidence for model behavior on open-ended generation or domain-specific specialty tasks.

Failure Modes

  • Instruction/hint reduces wrong answers but increases invalid/abstain outputs.
  • Few-shot examples can confuse some models and lengthen context, harming robustness.
  • Dataset generation via a model (ChatGPT) may introduce artifacts that change difficulty.

Core Entities

Models

  • GPT-4
  • ChatGPT
  • Claude
  • Bard
  • Vicuna-13B
  • LLaMA-13B
  • LLaMA-2-13B-chat

Metrics

  • VR
  • RR
  • FR
  • DMSS
  • Adaptivity
  • Upper-bound

Datasets

  • KRE
  • MuSiQue
  • SQuAD v2.0
  • ECQA
  • e-CARE

Benchmarks

  • KRE