Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
RAE lets you update LLM answers on multi-step questions quickly and cheaply by changing context instead of model weights, reducing API cost and avoiding retraining.
Summary TLDR
The paper presents RAE: a retrieval-driven model-editing pipeline for multi-hop question answering. RAE stores edited facts in a knowledge graph, retrieves subgraphs by maximizing mutual information (MI) between a candidate fact chain and the input question using an auxiliary next-token LLM, then prunes redundant facts via the target LLM's output entropy before in-context editing. On multi-hop editing benchmarks (MQuAKE variants) and several 7B-and-smaller models, RAE greatly improves edited answer accuracy vs embedding- and probability-based baselines. The code is public.
Problem Statement
LLMs struggle to integrate updated facts into multi-hop answers because conventional retrieval (embedding similarity or naive question decomposition) misses relevant facts whose entities differ from the query. This leads to wrong or outdated multi-step reasoning. The goal is to reliably fetch the correct edited fact chain for a question so in-context editing updates the model's answer without weight updates.
Main Contribution
RAE framework: combine a knowledge-graph of edited facts with mutual-information-driven retrieval and entropy-based pruning, then apply in-context editing.
A practical MI decomposition that uses next-token probabilities from an auxiliary autoregressive LLM to score candidate relations along fact chains.
A pruning rule that uses the edited model's output entropy to remove redundant or irrelevant retrieved facts.
Extensive experiments on MQuAKE-CF, MQuAKE-T, and Popular datasets across multiple open and proprietary LLMs, plus theoretical justification linking MI maximization to in-context learning activation.
Public code release: https://github.com/sycny/RAE.
Key Findings
MI-based retrieval substantially raises multi-hop retrieval precision compared to embedding or probability baselines.
Pruning retrieved facts by measuring the target LLM's prediction entropy reduces noise and raises edited accuracy.
RAE yields much higher edited-answer accuracy than prior editing methods on multi-hop tasks for smaller models.
RAE edits proprietary models effectively and cheaply using a small open LLM for retrieval.
Results
Accuracy
Accuracy
Retrieval precision P@1 (2-hop, MQUAKE-CF)
Accuracy
Who Should Care
What To Try In 7 Days
Store edited facts in a small knowledge graph (Wikidata-style) rather than only embeddings.
Implement MI-scoring using a small autoregressive model (GPT-2/GPT-J) to rank KG paths for a target question.
Add an entropy-based filter: keep facts that lower the target LLM's output entropy before sending prompts to production APIs.
Reproducibility
Code Urls
Data Urls
- https://github.com/sycny/RAE (implementation and data prep scripts)
- Wikidata (external KG)
- MQuAKE datasets (from prior work referenced)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires a KG that links edited facts; performance depends on KG coverage and quality.
- MI scoring relies on the auxiliary autoregressive LLM's reasoning; weak retrievers limit results.
- Entropy pruning assumes lower entropy equals correct fact coverage, which may fail if the model is overconfident on wrong facts.
When Not To Use
- When you lack a connected fact graph or cannot build one from your edits.
- When edits are free-form textual changes not expressible as triplets (head, relation, tail).
- When strict formal model parameter changes are required (e.g., downstream models must internals reflect edit).
Failure Modes
- Auxiliary LLM biases: MI may favor frequent relations and miss rare but correct paths.
- Over-pruning: entropy filter could drop needed facts if the target LLM is miscalibrated.
- KG mismatch: edits not injected into the external KG lead to broken chains and low recall.
Core Entities
Models
- GPT-2 (1.5B)
- GPT-J (6B)
- Falcon (7B)
- Vicuna (7B)
- Llama2-chat (7B)
- GPT-3.5
- GPT-4
- gpt-babbage-002
- GPT-3.5-turbo-0613
Metrics
- Accuracy
- Precision@K (P@K)
- Output entropy (Shannon entropy)
- Editing API cost
Datasets
- MQUAKE-CF
- MQUAKE-T
- MQUAKE-CF-9k
- Popular dataset (from prior work)
Benchmarks
- MQuAKE multi-hop editing benchmark (MQUAKE-CF/M-T)

