Overview
The idea is simple and implementable: use a small LLM to rank KG paths by MI, prune by entropy, then prompt the target LLM; experiments across models and APIs support its effectiveness.
Citations1
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
RAE lets you update LLM answers on multi-step questions quickly and cheaply by changing context instead of model weights, reducing API cost and avoiding retraining.
Who Should Care
Summary TLDR
The paper presents RAE: a retrieval-driven model-editing pipeline for multi-hop question answering. RAE stores edited facts in a knowledge graph, retrieves subgraphs by maximizing mutual information (MI) between a candidate fact chain and the input question using an auxiliary next-token LLM, then prunes redundant facts via the target LLM's output entropy before in-context editing. On multi-hop editing benchmarks (MQuAKE variants) and several 7B-and-smaller models, RAE greatly improves edited answer accuracy vs embedding- and probability-based baselines. The code is public.
Problem Statement
LLMs struggle to integrate updated facts into multi-hop answers because conventional retrieval (embedding similarity or naive question decomposition) misses relevant facts whose entities differ from the query. This leads to wrong or outdated multi-step reasoning. The goal is to reliably fetch the correct edited fact chain for a question so in-context editing updates the model's answer without weight updates.
Main Contribution
RAE framework: combine a knowledge-graph of edited facts with mutual-information-driven retrieval and entropy-based pruning, then apply in-context editing.
A practical MI decomposition that uses next-token probabilities from an auxiliary autoregressive LLM to score candidate relations along fact chains.
Key Findings
MI-based retrieval substantially raises multi-hop retrieval precision compared to embedding or probability baselines.
Pruning retrieved facts by measuring the target LLM's prediction entropy reduces noise and raises edited accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | RAE 62.8% | Subgraph Retriever 21.9% | Fine-tune 3.8% | Subgraph Retriever | +40.9 pp vs SR | MQUAKE-CF | Table 2, GPT-2 rows | Table 2 |
| Accuracy | RAE 69.3% | Subgraph Retriever 36.2% | Subgraph Retriever | +33.1 pp | MQUAKE-CF | Table 2, GPT-J rows | Table 2 |
What To Try In 7 Days
Store edited facts in a small knowledge graph (Wikidata-style) rather than only embeddings.
Implement MI-scoring using a small autoregressive model (GPT-2/GPT-J) to rank KG paths for a target question.
Add an entropy-based filter: keep facts that lower the target LLM's output entropy before sending prompts to production APIs.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Requires a KG that links edited facts; performance depends on KG coverage and quality.
MI scoring relies on the auxiliary autoregressive LLM's reasoning; weak retrievers limit results.
When Not To Use
When you lack a connected fact graph or cannot build one from your edits.
When edits are free-form textual changes not expressible as triplets (head, relation, tail).
Failure Modes
Auxiliary LLM biases: MI may favor frequent relations and miss rare but correct paths.
Over-pruning: entropy filter could drop needed facts if the target LLM is miscalibrated.

