Survey: How to add, update, and use external knowledge with large language models

Overview

Decision SnapshotNeeds Validation

Survey synthesizes many papers and benchmarks; evidence is descriptive rather than new experiments, so practical recommendations are reliable but need empirical tuning for each deployment.

Citations8

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/0

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Zhangyin Feng, Weitao Ma, Weijiang Yu, Lei Huang, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting liu

Links

Abstract / PDF

Why It Matters For Business

Keeping LLMs accurate saves user trust and legal risk: use prompt/input edits for cheap, fast fixes, model editing for durable updates, and retrieval for up-to-date answers when models show low confidence.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This paper surveys two main ways to give LLMs fresh, accurate knowledge: knowledge editing (changing model behavior by editing weights or adding plug-ins) and retrieval augmentation (fetching external documents at inference). It organizes methods, catalogs benchmarks (editing: ZsRE/CounterFact; retrieval: NQ/HotPotQA/FEVER), and highlights gaps: most edits target single facts, retrieval needs robust judgement and conflict resolution, and multi-source/multimodal integration is underexplored. Practical takeaways: prefer prompt/input edits for cheap fixes, use model editing for persistent changes, and use retrieval when models are uncertain or entity popularity is low.

Problem Statement

Large language models hold a lot of knowledge in their weights but still fail on up-to-date facts, long-tail entities, and hallucinations. Two complementary fixes exist: knowledge editing (change model behavior or attach plug-ins) and retrieval augmentation (keep model weights fixed and fetch external text). The field is fragmented and lacks a unified taxonomy, comprehensive benchmarks, and practical guidance for conflict resolution.

Main Contribution

Systematic taxonomy of knowledge-integration methods: input editing, model editing, and post-edit assessment

Detailed review of retrieval augmentation: when to fetch, how to fetch, how to use docs, and how to handle conflicts

Key Findings

Most knowledge-editing evaluations focus on triple-fact QA benchmarks like ZsRE and CounterFact.

NumbersZsRE: 182,282; CounterFact: 21,919

Practical UseIf you need to edit model facts, start by testing on these QA-style benchmarks; expect methods to be tuned to single-fact, QA settings.

Evidence RefSection 2.3 Table 1

Retrieval-judgement methods cluster into simple calibration thresholds and model-based judgments, each with trade-offs.

Practical UseUse confidence/popularity thresholds for quick wins; invest in model-based judgment if you need robust, dynamic retrieval decisions.

Evidence RefSection 3.1

What To Try In 7 Days

Log cases where your LLM is low-confidence or wrong; mark entity popularity

Add a retrieval step for low-popularity or low-confidence queries and measure accuracy lift

Prototype an input-editing prompt that prepends a short factual context and check impact on hallucination rates

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Survey focuses on English and Wikipedia-style sources; less coverage of private or multimodal knowledge sources

Many editing methods assume single-fact edits; real-world bulk or structured updates remain hard

When Not To Use

When you need guaranteed, provable updates across all model outputs without ripple effects

When deployment cannot support a retriever or external corpus

Failure Modes

Model ignores retrieved context and returns memorized (outdated) facts

Edited facts cause unintended changes to unrelated model behavior (ripple effects)

Core Entities

Models

ROMEMEMITMENDKENKBSERACT-PatcherGRACEPMETREPLUGREPLUG LSRGENREDSI

Metrics

EMF1Accuracy

Datasets

ZsRECounterFactCounterFact+Bi-ZsREMQUAKERippleEditsEva-KELLMNatural QuestionsTriviaQAPopQAHotPotQA2WikiMultiHopQAMuSiQueBamboogleFEVERFEVERousFoolMeTwiceStrategyQACommonsenseQAINFOTABS

Benchmarks

ZsRECounterFactBi-ZsREMQUAKENatural QuestionsHotPotQAFEVERStrategyQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Most knowledge-editing evaluations focus on triple-fact QA benchmarks like ZsRE and CounterFact.

Retrieval-judgement methods cluster into simple calibration thresholds and model-based judgments, each with trade-offs.

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding