Systematic comparison and new benchmarks for editing facts in LLMs

May 22, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper runs controlled comparisons on multiple editors and models and releases code and data, but results depend on datasets and a limited set of model sizes up to 20B, so practical adoption needs per-model pilots.

Citations12

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, Ningyu Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Model editing lets teams fix or update a deployed LLM quickly without expensive full retraining, but different editors trade off reliability, generalization, side effects, and ops cost.

Who Should Care

Summary TLDR

This paper surveys methods for "model editing"—changing an LLM's output for a targeted fact without breaking other behavior. It standardizes the task, re-implements many editors, runs controlled experiments on T5-XL, GPT-J, OPT-13B and GPT-NEOX-20B, and introduces a richer benchmark that measures portability (robust generalization), locality (side effects), and efficiency (time/memory). Key findings: some editors (e.g., SERAC, ROME) score very high on classic metrics but often fail portability or scale to some models; MEMIT supports massive batch edits but loses locality with many edits; in-context/memory methods (IKE) give better portability. Code and datasets are released.

Problem Statement

We lack standardized, practical ways to update or fix LLM knowledge without full retraining. Existing editing methods are scattered, evaluated inconsistently, and may not generalize, may create side effects, or be impractical due to time/memory. This work defines the task formally and evaluates methods under uniform conditions.

Main Contribution

A clear task definition and three desiderata for edits: reliability (fix the target), generalization (apply to equivalent inputs), locality (avoid side effects).

A unified experimental study re-implementing many editors (SERAC, IKE, T-Patcher, CaliNET, KE, MEND, KN, ROME, MEMIT, FT-L) on larger models (T5-XL, GPT-J) and scaling tests on OPT-13B and GPT-NEOX-20B.

Key Findings

Memory-based and locate-edit methods can reach near-perfect scores on standard benchmarks but still fail to transfer edits reliably to related facts.

NumbersSERAC: reliability 99.89% on COUNTERFACT (T5-XL)

Practical UseDon't treat high benchmark scores as proof of real-world correctness; test edited knowledge on related questions before deployment.

Evidence RefTable 1 (COUNTERFACT, SERAC)

MEMIT supports massive batch edits and stays effective up to very large edit counts in experiments.

NumbersTested robust up to 1000 simultaneous edits (batch editing)

Practical UseUse MEMIT for bulk knowledge updates when many facts change at once, but validate side effects on locality.

Evidence RefSection 4 Batch Editing; Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
SERAC reliability (COUNTERFACT, T5-XL)99.89%COUNTERFACT (T5-XL)High reliability shown in Table 1Table 1
IKE portability (GPT-J)Reversed-relation 92.96%, Subject-replace 88.77%, One-hop 55.38%Portability (GPT-J)Portability results Table 3Table 3

What To Try In 7 Days

Run SERAC or an in-context memory method (IKE) on a small set of high-value fixes to validate portability before wide rollout.

If you need many simultaneous updates, test MEMIT on a non-production copy and evaluate locality on your downstream tasks.

Measure wall-clock edit time and VRAM for candidate editors with 10 edits to estimate operational cost.

Agent Features

Memory
explicit edit memory (SERAC, MemPrompt, MeLLo)in-context demonstrations (IKE)
Architectures
encoder-decoderdecoder-only

Optimization Features

Infra Optimization
matrix inversion sensitivity (some models non-invertible)VRAM and time trade-offs for hypernetwork training (MEND,SERAC)
Training Optimization
precompute covariance stats for ROME/MEMIT

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments only up to 20B model size; behavior on newer architectures (e.g., LLaMA family) is untested.

Portability and multi-edit interactions beyond one-hop or many simultaneous edits need more study.

When Not To Use

When you need provable global consistency across all knowledge without manual validation.

If your production model architecture differs from tested backends (matrix-inversion editors may break).

Failure Modes

Matrix non-invertibility breaks ROME/MEMIT on certain models, producing failed or poor edits.

Edited facts that pass local tests can fail portability checks (rephrased or reverse questions remain unedited).

Core Entities

Models

T5-XLGPT-JOPT-13BGPT-NEOX-20BROMEMEMITSERACIKEMENDKET-PatcherCaliNETKNFT-LPMETMeLLoMemPrompt

Metrics

Accuracyportability subject-replace %portability reversed-relation %portability one-hop %locality other-attribute %wall-clock per-edit time (s)VRAM usage (GB)training time (hrs)

Datasets

ZsRECOUNTERFACTNatural Questions (NQ)Portability (subject-replace, reversed-relation, one-hop) [constructed]

Benchmarks

Reliability (edit target)Generalization (equivalence neighborhood)Locality (side effects)Portability (robust generalization)Batch editing and Sequential editingEfficiency (time, VRAM)

Context Entities

Models

BERT (prior small-model editing work)GPT-3 (context for in-context methods)

Datasets

Wikidata (aliases used for subject-replace)