Use mutual-information retrieval + entropy pruning to edit LLMs for multi-hop QA

March 28, 20247 min

Overview

Decision SnapshotReady For Pilot

The idea is simple and implementable: use a small LLM to rank KG paths by MI, prune by entropy, then prompt the target LLM; experiments across models and APIs support its effectiveness.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Yucheng Shi, Qiaoyu Tan, Xuansheng Wu, Shaochen Zhong, Kaixiong Zhou, Ninghao Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAE lets you update LLM answers on multi-step questions quickly and cheaply by changing context instead of model weights, reducing API cost and avoiding retraining.

Who Should Care

Summary TLDR

The paper presents RAE: a retrieval-driven model-editing pipeline for multi-hop question answering. RAE stores edited facts in a knowledge graph, retrieves subgraphs by maximizing mutual information (MI) between a candidate fact chain and the input question using an auxiliary next-token LLM, then prunes redundant facts via the target LLM's output entropy before in-context editing. On multi-hop editing benchmarks (MQuAKE variants) and several 7B-and-smaller models, RAE greatly improves edited answer accuracy vs embedding- and probability-based baselines. The code is public.

Problem Statement

LLMs struggle to integrate updated facts into multi-hop answers because conventional retrieval (embedding similarity or naive question decomposition) misses relevant facts whose entities differ from the query. This leads to wrong or outdated multi-step reasoning. The goal is to reliably fetch the correct edited fact chain for a question so in-context editing updates the model's answer without weight updates.

Main Contribution

RAE framework: combine a knowledge-graph of edited facts with mutual-information-driven retrieval and entropy-based pruning, then apply in-context editing.

A practical MI decomposition that uses next-token probabilities from an auxiliary autoregressive LLM to score candidate relations along fact chains.

Key Findings

MI-based retrieval substantially raises multi-hop retrieval precision compared to embedding or probability baselines.

NumbersP@1 up to 84.0% (RAE Llama2) vs 78.3% (SR Llama2) and 52.7% (embedding, 2-hop)

Practical UseUse MI-scoring over KG paths to find multi-hop edited facts instead of off-the-shelf embedding search for much better hit rates on required chains.

Evidence RefTable 3: retrieval precision (MQUAKE-CF)

Pruning retrieved facts by measuring the target LLM's prediction entropy reduces noise and raises edited accuracy.

NumbersAverage accuracy gain ≈ 14.5% across models; GPT-2 4-hop gain = 24.0%

Practical UseAfter retrieval, run a cheap entropy-based filter and drop facts that increase output uncertainty before in-context editing.

Evidence RefTable 4: w/o vs w/ pruning gains on MQUAKE-CF

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyRAE 62.8% | Subgraph Retriever 21.9% | Fine-tune 3.8%Subgraph Retriever+40.9 pp vs SRMQUAKE-CFTable 2, GPT-2 rowsTable 2
AccuracyRAE 69.3% | Subgraph Retriever 36.2%Subgraph Retriever+33.1 ppMQUAKE-CFTable 2, GPT-J rowsTable 2

What To Try In 7 Days

Store edited facts in a small knowledge graph (Wikidata-style) rather than only embeddings.

Implement MI-scoring using a small autoregressive model (GPT-2/GPT-J) to rank KG paths for a target question.

Add an entropy-based filter: keep facts that lower the target LLM's output entropy before sending prompts to production APIs.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://github.com/sycny/RAE (implementation and data prep scripts)Wikidata (external KG)MQuAKE datasets (from prior work referenced)

Risks & Boundaries

Limitations

Requires a KG that links edited facts; performance depends on KG coverage and quality.

MI scoring relies on the auxiliary autoregressive LLM's reasoning; weak retrievers limit results.

When Not To Use

When you lack a connected fact graph or cannot build one from your edits.

When edits are free-form textual changes not expressible as triplets (head, relation, tail).

Failure Modes

Auxiliary LLM biases: MI may favor frequent relations and miss rare but correct paths.

Over-pruning: entropy filter could drop needed facts if the target LLM is miscalibrated.

Core Entities

Models

GPT-2 (1.5B)GPT-J (6B)Falcon (7B)Vicuna (7B)Llama2-chat (7B)GPT-3.5GPT-4gpt-babbage-002GPT-3.5-turbo-0613

Metrics

AccuracyPrecision@K (P@K)Output entropy (Shannon entropy)Editing API cost

Datasets

MQUAKE-CFMQUAKE-TMQUAKE-CF-9kPopular dataset (from prior work)

Benchmarks

MQuAKE multi-hop editing benchmark (MQUAKE-CF/M-T)