Overview
The paper gives a usable dataset and clear metrics; results are practical but limited by dataset scope and partial model coverage, so apply findings cautiously in new domains.
Citations3
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Knowing if a model trusts prompts or its own memory changes how you design retrieval, prompts, and monitoring: pick high-RR models for using fresh external data, or use instruction-tuned dependent models when you need strict prompt compliance.
Who Should Care
Summary TLDR
The authors build KRE, a 11.7k-sample benchmark that tests how LLMs behave when a prompt conflicts with model memory. They define metrics (Vulnerable Robustness VR, Resilient Robustness RR, Factual Robustness FR, and Decision-Making Style Score DMSS), run seven models (GPT-4, ChatGPT, Claude, Bard, Vicuna-13B, LLaMA-13B, LLaMA-2-13B-chat), and show: models differ systematically in whether they rely on prompts (dependent) or internal memory (intuitive); role-play prompts can shift styles but adaptivity varies widely; hints reduce wrong prompt influence but raise invalid answers. The dataset, evaluation pipeline, and metrics are the main deliverables.
Problem Statement
LLMs sometimes receive prompts that conflict with knowledge stored in their parameters. We lack a clear, quantitative way to (1) classify whether a model follows prompts or its own memory, and (2) measure how robust a model is when prompt and memory disagree. This hurts real setups like RAG where external context can be newer or noisy.
Main Contribution
KRE: a 11,684-sample benchmark (from SQuAD, MuSiQue, ECQA, e-CARE) that pairs golden and misleading context for multiple-choice QA.
A robustness pipeline and four metrics: Vulnerable Robustness (VR), Resilient Robustness (RR), combined Factual Robustness (FR), and Decision-Making Style Score (DMSS).
Key Findings
GPT-4 achieves the highest ability to use correct prompt facts (RR) and highest overall factual robustness (FR) on the KRE benchmark.
Many instruction-tuned medium models tend to follow external prompts (dependent style); 4 out of 7 tested models are classified as dependent.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| VR (Vulnerable Robustness) | GPT-4 50, ChatGPT 32, Bard 54, Vicuna-13B 25, LLaMA-13B 20, LLaMA-2-13B-chat 24, Claude 34 | — | — | KRE overall | Table 9 shows VR (%) per model | Table 9 |
| RR (Resilient Robustness) | GPT-4 81, ChatGPT 79, Bard 68, Vicuna-13B 48, LLaMA-13B 21, LLaMA-2-13B-chat 62, Claude 57 | — | — | KRE overall | Table 9 shows RR (%) per model | Table 9 |
What To Try In 7 Days
Run a quick memory assessment on your model with 200 domain QA pairs to split D+ vs D- (use zero-shot).
Measure VR/RR on a small KRE-style set: inject correct and misleading contexts and record VR/RR.
Test a 'hint' prompt (warn about misleading context) and log change in wrong vs invalid answers to decide trade-off tolerance.
Agent Features
Memory
Tool Use
Reproducibility
Risks & Boundaries
Limitations
Evaluation uses KRE and subsets; findings may not generalize beyond these tasks and domains.
Metrics focus on multiple-choice QA; outcomes may differ on free-form generation or other tasks.
When Not To Use
For non-knowledge-intensive tasks like casual dialog where prompt vs memory conflict is irrelevant.
As sole evidence for model behavior on open-ended generation or domain-specific specialty tasks.
Failure Modes
Instruction/hint reduces wrong answers but increases invalid/abstain outputs.
Few-shot examples can confuse some models and lengthen context, harming robustness.

