Use mutual-information retrieval + entropy pruning to edit LLMs for multi-hop QA

March 28, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

1

Authors

Yucheng Shi, Qiaoyu Tan, Xuansheng Wu, Shaochen Zhong, Kaixiong Zhou, Ninghao Liu

Links

Abstract / PDF

Why It Matters For Business

RAE lets you update LLM answers on multi-step questions quickly and cheaply by changing context instead of model weights, reducing API cost and avoiding retraining.

Summary TLDR

The paper presents RAE: a retrieval-driven model-editing pipeline for multi-hop question answering. RAE stores edited facts in a knowledge graph, retrieves subgraphs by maximizing mutual information (MI) between a candidate fact chain and the input question using an auxiliary next-token LLM, then prunes redundant facts via the target LLM's output entropy before in-context editing. On multi-hop editing benchmarks (MQuAKE variants) and several 7B-and-smaller models, RAE greatly improves edited answer accuracy vs embedding- and probability-based baselines. The code is public.

Problem Statement

LLMs struggle to integrate updated facts into multi-hop answers because conventional retrieval (embedding similarity or naive question decomposition) misses relevant facts whose entities differ from the query. This leads to wrong or outdated multi-step reasoning. The goal is to reliably fetch the correct edited fact chain for a question so in-context editing updates the model's answer without weight updates.

Main Contribution

RAE framework: combine a knowledge-graph of edited facts with mutual-information-driven retrieval and entropy-based pruning, then apply in-context editing.

A practical MI decomposition that uses next-token probabilities from an auxiliary autoregressive LLM to score candidate relations along fact chains.

A pruning rule that uses the edited model's output entropy to remove redundant or irrelevant retrieved facts.

Extensive experiments on MQuAKE-CF, MQuAKE-T, and Popular datasets across multiple open and proprietary LLMs, plus theoretical justification linking MI maximization to in-context learning activation.

Public code release: https://github.com/sycny/RAE.

Key Findings

MI-based retrieval substantially raises multi-hop retrieval precision compared to embedding or probability baselines.

NumbersP@1 up to 84.0% (RAE Llama2) vs 78.3% (SR Llama2) and 52.7% (embedding, 2-hop)

Pruning retrieved facts by measuring the target LLM's prediction entropy reduces noise and raises edited accuracy.

NumbersAverage accuracy gain ≈ 14.5% across models; GPT-2 4-hop gain = 24.0%

RAE yields much higher edited-answer accuracy than prior editing methods on multi-hop tasks for smaller models.

NumbersExample: GPT-2 (M-CF) RAE 62.8% vs Subgraph Retriever 21.9% and fine-tune 3.8%

RAE edits proprietary models effectively and cheaply using a small open LLM for retrieval.

NumbersRAE raised GPT-4 edited accuracy by ~20 percentage points while using ~15% of the API cost of a comparative method (MELl

Results

Accuracy

ValueRAE 62.8% | Subgraph Retriever 21.9% | Fine-tune 3.8%

BaselineSubgraph Retriever

Accuracy

ValueRAE 69.3% | Subgraph Retriever 36.2%

BaselineSubgraph Retriever

Retrieval precision P@1 (2-hop, MQUAKE-CF)

ValueRAE (Llama2) 82.7% | SR (Llama2) 78.3% | Embedding (KG Link) 52.7%

BaselineSR (Llama2)

Accuracy

ValueAverage gain ≈ 14.5% across models

BaselineRAE without pruning

Who Should Care

What To Try In 7 Days

Store edited facts in a small knowledge graph (Wikidata-style) rather than only embeddings.

Implement MI-scoring using a small autoregressive model (GPT-2/GPT-J) to rank KG paths for a target question.

Add an entropy-based filter: keep facts that lower the target LLM's output entropy before sending prompts to production APIs.

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires a KG that links edited facts; performance depends on KG coverage and quality.
  • MI scoring relies on the auxiliary autoregressive LLM's reasoning; weak retrievers limit results.
  • Entropy pruning assumes lower entropy equals correct fact coverage, which may fail if the model is overconfident on wrong facts.

When Not To Use

  • When you lack a connected fact graph or cannot build one from your edits.
  • When edits are free-form textual changes not expressible as triplets (head, relation, tail).
  • When strict formal model parameter changes are required (e.g., downstream models must internals reflect edit).

Failure Modes

  • Auxiliary LLM biases: MI may favor frequent relations and miss rare but correct paths.
  • Over-pruning: entropy filter could drop needed facts if the target LLM is miscalibrated.
  • KG mismatch: edits not injected into the external KG lead to broken chains and low recall.

Core Entities

Models

  • GPT-2 (1.5B)
  • GPT-J (6B)
  • Falcon (7B)
  • Vicuna (7B)
  • Llama2-chat (7B)
  • GPT-3.5
  • GPT-4
  • gpt-babbage-002
  • GPT-3.5-turbo-0613

Metrics

  • Accuracy
  • Precision@K (P@K)
  • Output entropy (Shannon entropy)
  • Editing API cost

Datasets

  • MQUAKE-CF
  • MQUAKE-T
  • MQUAKE-CF-9k
  • Popular dataset (from prior work)

Benchmarks

  • MQuAKE multi-hop editing benchmark (MQUAKE-CF/M-T)