Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
3
Why It Matters For Business
Knowing if a model trusts prompts or its own memory changes how you design retrieval, prompts, and monitoring: pick high-RR models for using fresh external data, or use instruction-tuned dependent models when you need strict prompt compliance.
Summary TLDR
The authors build KRE, a 11.7k-sample benchmark that tests how LLMs behave when a prompt conflicts with model memory. They define metrics (Vulnerable Robustness VR, Resilient Robustness RR, Factual Robustness FR, and Decision-Making Style Score DMSS), run seven models (GPT-4, ChatGPT, Claude, Bard, Vicuna-13B, LLaMA-13B, LLaMA-2-13B-chat), and show: models differ systematically in whether they rely on prompts (dependent) or internal memory (intuitive); role-play prompts can shift styles but adaptivity varies widely; hints reduce wrong prompt influence but raise invalid answers. The dataset, evaluation pipeline, and metrics are the main deliverables.
Problem Statement
LLMs sometimes receive prompts that conflict with knowledge stored in their parameters. We lack a clear, quantitative way to (1) classify whether a model follows prompts or its own memory, and (2) measure how robust a model is when prompt and memory disagree. This hurts real setups like RAG where external context can be newer or noisy.
Main Contribution
KRE: a 11,684-sample benchmark (from SQuAD, MuSiQue, ECQA, e-CARE) that pairs golden and misleading context for multiple-choice QA.
A robustness pipeline and four metrics: Vulnerable Robustness (VR), Resilient Robustness (RR), combined Factual Robustness (FR), and Decision-Making Style Score (DMSS).
Large-scale evaluation of seven LLMs showing systematic styles (dependent, intuitive, rational) and that role-play instructions can change behavior with different adaptivity limits.
Key Findings
GPT-4 achieves the highest ability to use correct prompt facts (RR) and highest overall factual robustness (FR) on the KRE benchmark.
Many instruction-tuned medium models tend to follow external prompts (dependent style); 4 out of 7 tested models are classified as dependent.
Hints that warn about misleading context reduce the number of times models pick the misleading answer but increase invalid or 'I don't know' outputs.
Models generally better extract factual knowledge from prompts than commonsense knowledge; RR is higher on factual datasets than on commonsense datasets.
Role-play instructions can change a model's decision style, but adaptivity (how much a model can be flipped) varies widely across models.
Few-shot examples do not reliably improve robustness under conflict; 'all-positive' few-shot often yields highest RR and VR but does not always beat zero-shot.
Results
VR (Vulnerable Robustness)
RR (Resilient Robustness)
FR (Factual Robustness)
Accuracy
Hint effect on outputs (counts)
Who Should Care
What To Try In 7 Days
Run a quick memory assessment on your model with 200 domain QA pairs to split D+ vs D- (use zero-shot).
Measure VR/RR on a small KRE-style set: inject correct and misleading contexts and record VR/RR.
Test a 'hint' prompt (warn about misleading context) and log change in wrong vs invalid answers to decide trade-off tolerance.
Agent Features
Memory
- parametric memory probing (QA zero-shot)
Tool Use
- role-play instruction
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Evaluation uses KRE and subsets; findings may not generalize beyond these tasks and domains.
- Metrics focus on multiple-choice QA; outcomes may differ on free-form generation or other tasks.
- Negative contexts and some options were generated by ChatGPT; generation could bias difficulty.
- Not all models were exhaustively tested on the full dataset due to compute limits.
When Not To Use
- For non-knowledge-intensive tasks like casual dialog where prompt vs memory conflict is irrelevant.
- As sole evidence for model behavior on open-ended generation or domain-specific specialty tasks.
Failure Modes
- Instruction/hint reduces wrong answers but increases invalid/abstain outputs.
- Few-shot examples can confuse some models and lengthen context, harming robustness.
- Dataset generation via a model (ChatGPT) may introduce artifacts that change difficulty.
Core Entities
Models
- GPT-4
- ChatGPT
- Claude
- Bard
- Vicuna-13B
- LLaMA-13B
- LLaMA-2-13B-chat
Metrics
- VR
- RR
- FR
- DMSS
- Adaptivity
- Upper-bound
Datasets
- KRE
- MuSiQue
- SQuAD v2.0
- ECQA
- e-CARE
Benchmarks
- KRE

