Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
12
Why It Matters For Business
Model editing lets teams fix or update a deployed LLM quickly without expensive full retraining, but different editors trade off reliability, generalization, side effects, and ops cost.
Summary TLDR
This paper surveys methods for "model editing"—changing an LLM's output for a targeted fact without breaking other behavior. It standardizes the task, re-implements many editors, runs controlled experiments on T5-XL, GPT-J, OPT-13B and GPT-NEOX-20B, and introduces a richer benchmark that measures portability (robust generalization), locality (side effects), and efficiency (time/memory). Key findings: some editors (e.g., SERAC, ROME) score very high on classic metrics but often fail portability or scale to some models; MEMIT supports massive batch edits but loses locality with many edits; in-context/memory methods (IKE) give better portability. Code and datasets are released.
Problem Statement
We lack standardized, practical ways to update or fix LLM knowledge without full retraining. Existing editing methods are scattered, evaluated inconsistently, and may not generalize, may create side effects, or be impractical due to time/memory. This work defines the task formally and evaluates methods under uniform conditions.
Main Contribution
A clear task definition and three desiderata for edits: reliability (fix the target), generalization (apply to equivalent inputs), locality (avoid side effects).
A unified experimental study re-implementing many editors (SERAC, IKE, T-Patcher, CaliNET, KE, MEND, KN, ROME, MEMIT, FT-L) on larger models (T5-XL, GPT-J) and scaling tests on OPT-13B and GPT-NEOX-20B.
A new, broader evaluation suite (portability, locality, efficiency) and constructed datasets (subject-replace, reversed-relation, one-hop reasoning) to reveal real-world weaknesses.
Practical takeaways: which editors are fast, which support bulk edits, and where matrix-inversion methods fail across models.
Key Findings
Memory-based and locate-edit methods can reach near-perfect scores on standard benchmarks but still fail to transfer edits reliably to related facts.
MEMIT supports massive batch edits and stays effective up to very large edit counts in experiments.
Matrix-inversion-based editors can be brittle across model implementations and sizes.
In-context and memory-augmented editing (IKE/SERAC style) give better portability to related queries in many cases.
Pre-training or preparing some editors is costly in time and memory, but per-edit latency can be low afterward.
Per-edit wall-clock times vary widely: some editors return an edited model in seconds while others take minutes to hours.
Results
SERAC reliability (COUNTERFACT, T5-XL)
IKE portability (GPT-J)
ROME reliability (OPT-13B vs GPT-NEOX-20B)
Bulk-edit scale
Wall-clock time for 10 edits (GPT-J)
Who Should Care
What To Try In 7 Days
Run SERAC or an in-context memory method (IKE) on a small set of high-value fixes to validate portability before wide rollout.
If you need many simultaneous updates, test MEMIT on a non-production copy and evaluate locality on your downstream tasks.
Measure wall-clock edit time and VRAM for candidate editors with 10 edits to estimate operational cost.
Agent Features
Memory
- explicit edit memory (SERAC, MemPrompt, MeLLo)
- in-context demonstrations (IKE)
Architectures
- encoder-decoder
- decoder-only
Optimization Features
Infra Optimization
- matrix inversion sensitivity (some models non-invertible)
- VRAM and time trade-offs for hypernetwork training (MEND,SERAC)
Training Optimization
- precompute covariance stats for ROME/MEMIT
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments only up to 20B model size; behavior on newer architectures (e.g., LLaMA family) is untested.
- Portability and multi-edit interactions beyond one-hop or many simultaneous edits need more study.
- Some editors require expensive pretraining or heavy VRAM, limiting direct applicability for small teams.
When Not To Use
- When you need provable global consistency across all knowledge without manual validation.
- If your production model architecture differs from tested backends (matrix-inversion editors may break).
- When you lack GPU resources to train hypernetworks or collect covariance statistics.
Failure Modes
- Matrix non-invertibility breaks ROME/MEMIT on certain models, producing failed or poor edits.
- Edited facts that pass local tests can fail portability checks (rephrased or reverse questions remain unedited).
- Sequential or many edits can drift model behavior, lowering reliability and increasing side effects.
Core Entities
Models
- T5-XL
- GPT-J
- OPT-13B
- GPT-NEOX-20B
- ROME
- MEMIT
- SERAC
- IKE
- MEND
- KE
- T-Patcher
- CaliNET
- KN
- FT-L
- PMET
- MeLLo
- MemPrompt
Metrics
- Accuracy
- portability subject-replace %
- portability reversed-relation %
- portability one-hop %
- locality other-attribute %
- wall-clock per-edit time (s)
- VRAM usage (GB)
- training time (hrs)
Datasets
- ZsRE
- COUNTERFACT
- Natural Questions (NQ)
- Portability (subject-replace, reversed-relation, one-hop) [constructed]
Benchmarks
- Reliability (edit target)
- Generalization (equivalence neighborhood)
- Locality (side effects)
- Portability (robust generalization)
- Batch editing and Sequential editing
- Efficiency (time, VRAM)
Context Entities
Models
- BERT (prior small-model editing work)
- GPT-3 (context for in-context methods)
Datasets
- Wikidata (aliases used for subject-replace)

