Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
8
Why It Matters For Business
Keeping LLMs accurate saves user trust and legal risk: use prompt/input edits for cheap, fast fixes, model editing for durable updates, and retrieval for up-to-date answers when models show low confidence.
Summary TLDR
This paper surveys two main ways to give LLMs fresh, accurate knowledge: knowledge editing (changing model behavior by editing weights or adding plug-ins) and retrieval augmentation (fetching external documents at inference). It organizes methods, catalogs benchmarks (editing: ZsRE/CounterFact; retrieval: NQ/HotPotQA/FEVER), and highlights gaps: most edits target single facts, retrieval needs robust judgement and conflict resolution, and multi-source/multimodal integration is underexplored. Practical takeaways: prefer prompt/input edits for cheap fixes, use model editing for persistent changes, and use retrieval when models are uncertain or entity popularity is low.
Problem Statement
Large language models hold a lot of knowledge in their weights but still fail on up-to-date facts, long-tail entities, and hallucinations. Two complementary fixes exist: knowledge editing (change model behavior or attach plug-ins) and retrieval augmentation (keep model weights fixed and fetch external text). The field is fragmented and lacks a unified taxonomy, comprehensive benchmarks, and practical guidance for conflict resolution.
Main Contribution
Systematic taxonomy of knowledge-integration methods: input editing, model editing, and post-edit assessment
Detailed review of retrieval augmentation: when to fetch, how to fetch, how to use docs, and how to handle conflicts
Catalog of benchmarks for both editing and retrieval, plus a short roadmap of open problems and applications
Key Findings
Most knowledge-editing evaluations focus on triple-fact QA benchmarks like ZsRE and CounterFact.
Retrieval-judgement methods cluster into simple calibration thresholds and model-based judgments, each with trade-offs.
Model editing methods range from single precise edits to bulk edits; MEMIT can update thousands of edits at once.
Who Should Care
What To Try In 7 Days
Log cases where your LLM is low-confidence or wrong; mark entity popularity
Add a retrieval step for low-popularity or low-confidence queries and measure accuracy lift
Prototype an input-editing prompt that prepends a short factual context and check impact on hallucination rates
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Survey focuses on English and Wikipedia-style sources; less coverage of private or multimodal knowledge sources
- Many editing methods assume single-fact edits; real-world bulk or structured updates remain hard
- Conflict resolution between parametric memory and retrieved docs is mostly analyzed, not solved
When Not To Use
- When you need guaranteed, provable updates across all model outputs without ripple effects
- When deployment cannot support a retriever or external corpus
- When you require real-time private data and cannot expose it to external retrievers
Failure Modes
- Model ignores retrieved context and returns memorized (outdated) facts
- Edited facts cause unintended changes to unrelated model behavior (ripple effects)
- Retrieval returns noisy or adversarial passages and misleads the model
- Threshold-based retrieval decisions fail across domains due to calibration drift
Core Entities
Models
- ROME
- MEMIT
- MEND
- KE
- NKB
- SERAC
- T-Patcher
- GRACE
- PMET
- REPLUG
- REPLUG LSR
- GENRE
- DSI
Metrics
- EM
- F1
- Accuracy
Datasets
- ZsRE
- CounterFact
- CounterFact+
- Bi-ZsRE
- MQUAKE
- RippleEdits
- Eva-KELLM
- Natural Questions
- TriviaQA
- PopQA
- HotPotQA
- 2WikiMultiHopQA
- MuSiQue
- Bamboogle
- FEVER
- FEVERous
- FoolMeTwice
- StrategyQA
- CommonsenseQA
- INFOTABS
Benchmarks
- ZsRE
- CounterFact
- Bi-ZsRE
- MQUAKE
- Natural Questions
- HotPotQA
- FEVER
- StrategyQA

