Use mutual-information retrieval + entropy pruning to edit LLMs for multi-hop QA

Overview

Decision SnapshotReady For Pilot

The idea is simple and implementable: use a small LLM to rank KG paths by MI, prune by entropy, then prompt the target LLM; experiments across models and APIs support its effectiveness.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Yucheng Shi, Qiaoyu Tan, Xuansheng Wu, Shaochen Zhong, Kaixiong Zhou, Ninghao Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAE lets you update LLM answers on multi-step questions quickly and cheaply by changing context instead of model weights, reducing API cost and avoiding retraining.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

The paper presents RAE: a retrieval-driven model-editing pipeline for multi-hop question answering. RAE stores edited facts in a knowledge graph, retrieves subgraphs by maximizing mutual information (MI) between a candidate fact chain and the input question using an auxiliary next-token LLM, then prunes redundant facts via the target LLM's output entropy before in-context editing. On multi-hop editing benchmarks (MQuAKE variants) and several 7B-and-smaller models, RAE greatly improves edited answer accuracy vs embedding- and probability-based baselines. The code is public.

Problem Statement

LLMs struggle to integrate updated facts into multi-hop answers because conventional retrieval (embedding similarity or naive question decomposition) misses relevant facts whose entities differ from the query. This leads to wrong or outdated multi-step reasoning. The goal is to reliably fetch the correct edited fact chain for a question so in-context editing updates the model's answer without weight updates.

Main Contribution

RAE framework: combine a knowledge-graph of edited facts with mutual-information-driven retrieval and entropy-based pruning, then apply in-context editing.

A practical MI decomposition that uses next-token probabilities from an auxiliary autoregressive LLM to score candidate relations along fact chains.

Key Findings

MI-based retrieval substantially raises multi-hop retrieval precision compared to embedding or probability baselines.

NumbersP@1 up to 84.0% (RAE Llama2) vs 78.3% (SR Llama2) and 52.7% (embedding, 2-hop)

Practical UseUse MI-scoring over KG paths to find multi-hop edited facts instead of off-the-shelf embedding search for much better hit rates on required chains.

Evidence RefTable 3: retrieval precision (MQUAKE-CF)

Pruning retrieved facts by measuring the target LLM's prediction entropy reduces noise and raises edited accuracy.

NumbersAverage accuracy gain ≈ 14.5% across models; GPT-2 4-hop gain = 24.0%

Practical UseAfter retrieval, run a cheap entropy-based filter and drop facts that increase output uncertainty before in-context editing.

Evidence RefTable 4: w/o vs w/ pruning gains on MQUAKE-CF

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	RAE 62.8% \| Subgraph Retriever 21.9% \| Fine-tune 3.8%	Subgraph Retriever	+40.9 pp vs SR	MQUAKE-CF	Table 2, GPT-2 rows	Table 2
Accuracy	RAE 69.3% \| Subgraph Retriever 36.2%	Subgraph Retriever	+33.1 pp	MQUAKE-CF	Table 2, GPT-J rows	Table 2

What To Try In 7 Days

Store edited facts in a small knowledge graph (Wikidata-style) rather than only embeddings.

Implement MI-scoring using a small autoregressive model (GPT-2/GPT-J) to rank KG paths for a target question.

Add an entropy-based filter: keep facts that lower the target LLM's output entropy before sending prompts to production APIs.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/sycny/RAE

Data URLs

https://github.com/sycny/RAE (implementation and data prep scripts)Wikidata (external KG)MQuAKE datasets (from prior work referenced)

Risks & Boundaries

Limitations

Requires a KG that links edited facts; performance depends on KG coverage and quality.

MI scoring relies on the auxiliary autoregressive LLM's reasoning; weak retrievers limit results.

When Not To Use

When you lack a connected fact graph or cannot build one from your edits.

When edits are free-form textual changes not expressible as triplets (head, relation, tail).

Failure Modes

Auxiliary LLM biases: MI may favor frequent relations and miss rare but correct paths.

Over-pruning: entropy filter could drop needed facts if the target LLM is miscalibrated.

Core Entities

Models

GPT-2 (1.5B)GPT-J (6B)Falcon (7B)Vicuna (7B)Llama2-chat (7B)GPT-3.5GPT-4gpt-babbage-002GPT-3.5-turbo-0613

Metrics

AccuracyPrecision@K (P@K)Output entropy (Shannon entropy)Editing API cost

Datasets

MQUAKE-CFMQUAKE-TMQUAKE-CF-9kPopular dataset (from prior work)

Benchmarks

MQuAKE multi-hop editing benchmark (MQUAKE-CF/M-T)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MI-based retrieval substantially raises multi-hop retrieval precision compared to embedding or probability baselines.

Pruning retrieved facts by measuring the target LLM's prediction entropy reduces noise and raises edited accuracy.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Add explicit, verifiable rationales and reranking to RAG to cut hallucinations in biomedical QA

Key finding

Teach LLMs to spot and avoid context-based hallucinations by masking retrieval heads and contrastive tuning

Key finding

Fin-RATE: a realistic SEC-filings benchmark that stresses cross-document, cross-year and cross-company financial reasoning

Key finding

Not all retrieval noise is bad: some noises consistently help LLMs, others break them

Key finding

Marathon: a multiple-choice benchmark that stresses LLMs with very long documents (up to ~260K chars)

Key finding