Systematic comparison and new benchmarks for editing facts in LLMs

May 22, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

12

Authors

Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, Ningyu Zhang

Links

Abstract / PDF

Why It Matters For Business

Model editing lets teams fix or update a deployed LLM quickly without expensive full retraining, but different editors trade off reliability, generalization, side effects, and ops cost.

Summary TLDR

This paper surveys methods for "model editing"—changing an LLM's output for a targeted fact without breaking other behavior. It standardizes the task, re-implements many editors, runs controlled experiments on T5-XL, GPT-J, OPT-13B and GPT-NEOX-20B, and introduces a richer benchmark that measures portability (robust generalization), locality (side effects), and efficiency (time/memory). Key findings: some editors (e.g., SERAC, ROME) score very high on classic metrics but often fail portability or scale to some models; MEMIT supports massive batch edits but loses locality with many edits; in-context/memory methods (IKE) give better portability. Code and datasets are released.

Problem Statement

We lack standardized, practical ways to update or fix LLM knowledge without full retraining. Existing editing methods are scattered, evaluated inconsistently, and may not generalize, may create side effects, or be impractical due to time/memory. This work defines the task formally and evaluates methods under uniform conditions.

Main Contribution

A clear task definition and three desiderata for edits: reliability (fix the target), generalization (apply to equivalent inputs), locality (avoid side effects).

A unified experimental study re-implementing many editors (SERAC, IKE, T-Patcher, CaliNET, KE, MEND, KN, ROME, MEMIT, FT-L) on larger models (T5-XL, GPT-J) and scaling tests on OPT-13B and GPT-NEOX-20B.

A new, broader evaluation suite (portability, locality, efficiency) and constructed datasets (subject-replace, reversed-relation, one-hop reasoning) to reveal real-world weaknesses.

Practical takeaways: which editors are fast, which support bulk edits, and where matrix-inversion methods fail across models.

Key Findings

Memory-based and locate-edit methods can reach near-perfect scores on standard benchmarks but still fail to transfer edits reliably to related facts.

NumbersSERAC: reliability 99.89% on COUNTERFACT (T5-XL)

MEMIT supports massive batch edits and stays effective up to very large edit counts in experiments.

NumbersTested robust up to 1000 simultaneous edits (batch editing)

Matrix-inversion-based editors can be brittle across model implementations and sizes.

NumbersROME reliability: 22.23% on OPT-13B vs 99.34% on GPT-NEOX-20B

In-context and memory-augmented editing (IKE/SERAC style) give better portability to related queries in many cases.

NumbersIKE on GPT-J: subject-replace 88.77%, reversed-relation 92.96%

Pre-training or preparing some editors is costly in time and memory, but per-edit latency can be low afterward.

NumbersSERAC training >36 hrs (3×V100); MEND training >7 hrs; MEND VRAM >60GB for training

Per-edit wall-clock times vary widely: some editors return an edited model in seconds while others take minutes to hours.

Numbers10-edit times on GPT-J: MEND 0.51s, SERAC 5.31s, ROME 147.2s, MEMIT 143.2s, T-Patcher 1864.74s

Results

SERAC reliability (COUNTERFACT, T5-XL)

Value99.89%

IKE portability (GPT-J)

ValueReversed-relation 92.96%, Subject-replace 88.77%, One-hop 55.38%

ROME reliability (OPT-13B vs GPT-NEOX-20B)

ValueOPT-13B 22.23% vs GPT-NEOX-20B 99.34%

Bulk-edit scale

ValueMEMIT tested up to 1000 edits while maintaining performance

Wall-clock time for 10 edits (GPT-J)

ValueMEND 0.51s; SERAC 5.31s; ROME 147.2s; MEMIT 143.2s; T-Patcher 1864.74s

Who Should Care

What To Try In 7 Days

Run SERAC or an in-context memory method (IKE) on a small set of high-value fixes to validate portability before wide rollout.

If you need many simultaneous updates, test MEMIT on a non-production copy and evaluate locality on your downstream tasks.

Measure wall-clock edit time and VRAM for candidate editors with 10 edits to estimate operational cost.

Agent Features

Memory

  • explicit edit memory (SERAC, MemPrompt, MeLLo)
  • in-context demonstrations (IKE)

Architectures

  • encoder-decoder
  • decoder-only

Optimization Features

Infra Optimization

  • matrix inversion sensitivity (some models non-invertible)
  • VRAM and time trade-offs for hypernetwork training (MEND,SERAC)

Training Optimization

  • precompute covariance stats for ROME/MEMIT

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments only up to 20B model size; behavior on newer architectures (e.g., LLaMA family) is untested.
  • Portability and multi-edit interactions beyond one-hop or many simultaneous edits need more study.
  • Some editors require expensive pretraining or heavy VRAM, limiting direct applicability for small teams.

When Not To Use

  • When you need provable global consistency across all knowledge without manual validation.
  • If your production model architecture differs from tested backends (matrix-inversion editors may break).
  • When you lack GPU resources to train hypernetworks or collect covariance statistics.

Failure Modes

  • Matrix non-invertibility breaks ROME/MEMIT on certain models, producing failed or poor edits.
  • Edited facts that pass local tests can fail portability checks (rephrased or reverse questions remain unedited).
  • Sequential or many edits can drift model behavior, lowering reliability and increasing side effects.

Core Entities

Models

  • T5-XL
  • GPT-J
  • OPT-13B
  • GPT-NEOX-20B
  • ROME
  • MEMIT
  • SERAC
  • IKE
  • MEND
  • KE
  • T-Patcher
  • CaliNET
  • KN
  • FT-L
  • PMET
  • MeLLo
  • MemPrompt

Metrics

  • Accuracy
  • portability subject-replace %
  • portability reversed-relation %
  • portability one-hop %
  • locality other-attribute %
  • wall-clock per-edit time (s)
  • VRAM usage (GB)
  • training time (hrs)

Datasets

  • ZsRE
  • COUNTERFACT
  • Natural Questions (NQ)
  • Portability (subject-replace, reversed-relation, one-hop) [constructed]

Benchmarks

  • Reliability (edit target)
  • Generalization (equivalence neighborhood)
  • Locality (side effects)
  • Portability (robust generalization)
  • Batch editing and Sequential editing
  • Efficiency (time, VRAM)

Context Entities

Models

  • BERT (prior small-model editing work)
  • GPT-3 (context for in-context methods)

Datasets

  • Wikidata (aliases used for subject-replace)